From Nitin.Choudhary at palm.com Thu Oct 1 02:37:29 2009 From: Nitin.Choudhary at palm.com (Nitin Choudhary) Date: Wed, 30 Sep 2009 19:37:29 -0700 Subject: [Linux-cluster] Cluster Pre-Production testing In-Reply-To: <1254353513.24970.19.camel@localhost.localdomain> References: <1254353513.24970.19.camel@localhost.localdomain> Message-ID: Hi! Process I followed similar to what we test for Veritas Clusters. For two node cluster: Perform these test on both nodes. 1. Pull out the Ethernet cable, If the channel bonding is implemented then both. 2. Reboot the server. 3. Power Cycle (forcibly) 4. Disconnect the Fiber cable to simulate fabric failures. 5. Kill the CMAN/rgmanager/qdiskd 6. Manually fence the other node. Check messages files for any other errors. Any one has more suggestion ? Thanks, Nitin -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steve OBrien Sent: Wednesday, September 30, 2009 4:32 PM To: linux-cluster at redhat.com Subject: [Linux-cluster] Cluster Pre-Production testing Does anyone have a reliable method for testing a cluster before putting it into production? TIA, Steve -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From pollochicken at gmail.com Thu Oct 1 02:43:09 2009 From: pollochicken at gmail.com (Adalberto Rivera Laporte) Date: Wed, 30 Sep 2009 21:43:09 -0500 Subject: [Linux-cluster] Cluster Pre-Production testing In-Reply-To: References: <1254353513.24970.19.camel@localhost.localdomain> Message-ID: <6EA04BBC-31ED-44DF-911E-D4850203134B@gmail.com> Hi all, First time posting to the list - Another way I use to test our clusters Is to kernel panic the box, in addition to what was already mentioned. echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger --- Hope that helps. Alberto L. On Sep 30, 2009, at 9:37 PM, Nitin Choudhary wrote: > Hi! > > Process I followed similar to what we test for Veritas Clusters. > > For two node cluster: > > Perform these test on both nodes. > > 1. Pull out the Ethernet cable, If the channel bonding is > implemented then both. > 2. Reboot the server. > 3. Power Cycle (forcibly) > 4. Disconnect the Fiber cable to simulate fabric failures. > 5. Kill the CMAN/rgmanager/qdiskd > 6. Manually fence the other node. > > Check messages files for any other errors. > > Any one has more suggestion ? > > Thanks, > > Nitin > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- > bounces at redhat.com] On Behalf Of Steve OBrien > Sent: Wednesday, September 30, 2009 4:32 PM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] Cluster Pre-Production testing > > Does anyone have a reliable method for testing a cluster before > putting > it into production? > > TIA, > Steve > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From frank at si.ct.upc.edu Thu Oct 1 07:15:08 2009 From: frank at si.ct.upc.edu (frank) Date: Thu, 01 Oct 2009 09:15:08 +0200 Subject: [Linux-cluster] cluster just for GFS Message-ID: <4AC456FC.4020009@si.ct.upc.edu> Hi, we have a two Red Hat 5.3 cluster node just because we need a GFS filesystem mounted on both nodes. We have configured the cluster with the two nodes, of course, and a fence device for each one, which is a ipmilan. Does this configuration have sense? I mean, we have configured fence devices but if there is no really any cluster resource do fence devices do something? I forgot to mention that we have set GFS filesystem to be mounted on boot ("/etc/init.d/gfs start" and the GFS filesystem is defined in /etc/fstab). Is there any better approach? Thanks for your help. Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est? net. For all your IT requirements visit: http://www.transtec.co.uk From swhiteho at redhat.com Thu Oct 1 09:16:16 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 01 Oct 2009 10:16:16 +0100 Subject: [Linux-cluster] cluster just for GFS In-Reply-To: <4AC456FC.4020009@si.ct.upc.edu> References: <4AC456FC.4020009@si.ct.upc.edu> Message-ID: <1254388576.2697.1.camel@localhost.localdomain> Hi, On Thu, 2009-10-01 at 09:15 +0200, frank wrote: > Hi, > we have a two Red Hat 5.3 cluster node just because we need a GFS > filesystem mounted on both nodes. > We have configured the cluster with the two nodes, of course, and a > fence device for each one, which is a ipmilan. > > Does this configuration have sense? I mean, we have configured fence > devices but if there is no really any cluster resource do fence devices > do something? > I forgot to mention that we have set GFS filesystem to be mounted on > boot ("/etc/init.d/gfs start" and the GFS filesystem is defined in > /etc/fstab). > > Is there any better approach? > > Thanks for your help. > > Frank > It sounds like you have done all the right things to me. Fencing is required to prevent a node which is thought to be failed (and thus whose journal is being recovered by another node) from coming back to life in the middle of the recovery process and thus corrupting the filesystem, Steve. From nicolas.ferre at univ-provence.fr Thu Oct 1 09:36:28 2009 From: nicolas.ferre at univ-provence.fr (=?ISO-8859-1?Q?Nicolas_Ferr=E9?=) Date: Thu, 01 Oct 2009 11:36:28 +0200 Subject: [Linux-cluster] Too large load on the login node Message-ID: <4AC4781C.5080307@univ-provence.fr> Hi, We recently installed a new cluster composed of 1 login node and several computing nodes running CentOS. These nodes share a GFS2 fs made of two partitions. A strange thing is that on each node, the activity load (as monitored by the 'top' command) is always larger than 1. After some googling, it seems someone already reported this problem but I can't see any solution. Moreover, on the login node, the load is even larger: top - 11:34:17 up 1 day, 22:07, 1 user, load average: 16.19, 16.20, 16.12 while there is no cpu-intensive running processes. Do you have an explanation? -- Nicolas Ferre' Laboratoire Chimie Provence From swhiteho at redhat.com Thu Oct 1 09:46:45 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 01 Oct 2009 10:46:45 +0100 Subject: [Linux-cluster] Too large load on the login node In-Reply-To: <4AC4781C.5080307@univ-provence.fr> References: <4AC4781C.5080307@univ-provence.fr> Message-ID: <1254390405.2697.8.camel@localhost.localdomain> Hi, On Thu, 2009-10-01 at 11:36 +0200, Nicolas Ferr? wrote: > Hi, > > We recently installed a new cluster composed of 1 login node and several > computing nodes running CentOS. These nodes share a GFS2 fs made of two > partitions. > > A strange thing is that on each node, the activity load (as monitored by > the 'top' command) is always larger than 1. After some googling, it > seems someone already reported this problem but I can't see any solution. > Moreover, on the login node, the load is even larger: > top - 11:34:17 up 1 day, 22:07, 1 user, load average: 16.19, 16.20, 16.12 > while there is no cpu-intensive running processes. > > Do you have an explanation? Processes in uninterruptible sleep are counted in the load average. One of the gfs2 daemons (in early versions) was set to sleep in this way. I suggest that you should upgrade to a more recent version (simply because a number of other bugs have been fixed since then) although the uninterruptible sleep is harmless aside from its effect on the load average, Steve. From robejrm at gmail.com Thu Oct 1 09:46:37 2009 From: robejrm at gmail.com (Juan Ramon Martin Blanco) Date: Thu, 1 Oct 2009 11:46:37 +0200 Subject: [Linux-cluster] Too large load on the login node In-Reply-To: <4AC4781C.5080307@univ-provence.fr> References: <4AC4781C.5080307@univ-provence.fr> Message-ID: <8a5668960910010246l6f9ab4ccw4c96d5a36ed11e02@mail.gmail.com> On Thu, Oct 1, 2009 at 11:36 AM, Nicolas Ferr? wrote: > Hi, > > We recently installed a new cluster composed of 1 login node and several > computing nodes running CentOS. These nodes share a GFS2 fs made of two > partitions. > > A strange thing is that on each node, the activity load (as monitored by the > 'top' command) is always larger than 1. After some googling, it seems > someone already reported this problem but I can't see any solution. > Moreover, on the login node, the load is even larger: > top - 11:34:17 up 1 day, 22:07, 1 user, load average: 16.19, 16.20, 16.12 > while there is no cpu-intensive running processes. > Which cluster-suite versions are you using? Look for processes in state D or R (ps aux) and paste them here. Greetings, Juanra > Do you have an explanation? > -- > Nicolas Ferre' > Laboratoire Chimie Provence > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From r.rosenberger at netbiscuits.com Thu Oct 1 12:12:27 2009 From: r.rosenberger at netbiscuits.com (Rene Rosenberger) Date: Thu, 1 Oct 2009 14:12:27 +0200 Subject: AW: AW: [Linux-cluster] Problems starting a VM Service In-Reply-To: <1254323288.14760.14.camel@localhost.localdomain> References: <22a301ca41c7$7e40e540$7ac2afc0$@rosenberger@netbiscuits.com> <1254316836.16878.18.camel@localhost.localdomain> <22d801ca41d1$6e5ed740$4b1c85c0$@rosenberger@netbiscuits.com> <1254323288.14760.14.camel@localhost.localdomain> Message-ID: <017401ca4290$82aab5b0$88002110$@rosenberger@netbiscuits.com> Hi again, here is the debug output when i try to start Log-Server: [root at cluster-node01 tmp]# cat DEBUG + PATH=/bin:/sbin:/usr/bin:/usr/sbin + export PATH ++ dirname /usr/share/cluster/vm.sh + . /usr/share/cluster/ocf-shellfuncs +++ basename /usr/share/cluster/vm.sh ++ __SCRIPT_NAME=vm.sh ++ consoletype ++ '[' 1 -eq 1 ']' ++ __SERIAL=yes ++ __LOG_PID=7780 ++++ readlink /proc/7780/exe +++ basename /usr/sbin/clurgmgrd ++ __LOG_NAME=clurgmgrd ++ __ocf_set_defaults stop ++ __OCF_ACTION=stop ++ unset LANG ++ LC_ALL=C ++ export LC_ALL ++ OCF_SUCCESS=0 ++ OCF_ERR_GENERIC=1 ++ OCF_ERR_ARGS=2 ++ OCF_ERR_UNIMPLEMENTED=3 ++ OCF_ERR_PERM=4 ++ OCF_ERR_INSTALLED=5 ++ OCF_ERR_CONFIGURED=6 ++ OCF_NOT_RUNNING=7 ++ '[' -z vm ']' ++ '[' -z 1 ']' ++ '[' -z /usr/share/cluster ']' ++ '[' '!' -d /usr/share/cluster ']' ++ '[' x1 '!=' x1 ']' ++ '[' -z 0 ']' ++ '[' xstop = xmeta-data ']' ++ '[' -z vm:Log-Server ']' + export OCF_APP_ERR_INDETERMINATE=150 + OCF_APP_ERR_INDETERMINATE=150 + case $1 in + validate_all ++ id -u + '[' 0 = 0 ']' + '[' -z auto ']' + '[' auto = auto ']' ++ virsh version ++ grep 'Running hypervisor:' ++ tr A-Z a-z ++ awk '{print $3}' + export OCF_RESKEY_hypervisor=xen + OCF_RESKEY_hypervisor=xen + '[' -z xen ']' + echo Hypervisor: xen Hypervisor: xen + '[' 1 = 0 ']' + '[' -z '' ']' + echo 'Management tool: virsh' Management tool: virsh + export OCF_RESKEY_use_virsh=1 + OCF_RESKEY_use_virsh=1 + '[' -z auto -o auto = auto ']' + '[' 1 = 1 ']' + '[' xen = qemu ']' + '[' xen = xen ']' + OCF_RESKEY_hypervisor_uri=xen:/// + echo Hypervisor URI: xen:/// Hypervisor URI: xen:/// + '[' -z auto -o auto = auto ']' + '[' 1 = 1 ']' + '[' xen = qemu ']' + '[' xen = xen ']' + export OCF_RESKEY_migration_uri=xenmigr://%s/ + OCF_RESKEY_migration_uri=xenmigr://%s/ + '[' -n xenmigr://%s/ ']' ++ printf xenmigr://%s/ target_host + echo Migration URI format: xenmigr://target_host/ Migration URI format: xenmigr://target_host/ + '[' -z Log-Server ']' + return 0 + do_stop shutdown destroy + declare domstate rv ++ do_status ++ '[' 1 = 1 ']' ++ virsh_status ++ declare state pid ++ '[' xen = xen ']' ++ service xend status ++ '[' 0 -ne 0 ']' +++ pidof libvirtd ++ pid=7002 ++ '[' -z 7002 ']' +++ virsh domstate Log-Server ++ state='shut off' ++ echo shut off ++ '[' 'shut off' = running ']' ++ '[' 'shut off' = paused ']' ++ '[' 'shut off' = 'no state' ']' ++ '[' 'shut off' = idle ']' ++ return 1 ++ return 1 + domstate='shut off' + rv=1 + ocf_log debug 'Virtual machine Log-Server is shut off' + '[' 2 -lt 2 ']' + declare __OCF_PRIO=debug + declare -i __OCF_PRIO_N + shift + declare '__OCF_MSG=Virtual machine Log-Server is shut off' + case "${__OCF_PRIO}" in + __OCF_PRIO_N=7 + pretty_echo debug 'Virtual machine Log-Server is shut off' + declare pretty + declare 'n=' + declare __OCF_PRIO=debug + shift + declare '__OCF_MSG=Virtual machine Log-Server is shut off' + '[' -n yes ']' + echo ' Virtual machine Log-Server is shut off' Virtual machine Log-Server is shut off + return 0 ++ which clulog + '[' -z /usr/sbin/clulog ']' + clulog -p 7780 -n clurgmgrd -s 7 'Virtual machine Log-Server is shut off' + '[' 1 -eq 150 ']' + '[' 1 = 1 ']' + do_virsh_stop shutdown destroy + declare -i timeout=60 + declare -i ret=1 + declare state ++ do_status ++ '[' 1 = 1 ']' ++ virsh_status ++ declare state pid ++ '[' xen = xen ']' ++ service xend status ++ '[' 0 -ne 0 ']' +++ pidof libvirtd ++ pid=7002 ++ '[' -z 7002 ']' +++ virsh domstate Log-Server ++ state='shut off' ++ echo shut off ++ '[' 'shut off' = running ']' ++ '[' 'shut off' = paused ']' ++ '[' 'shut off' = 'no state' ']' ++ '[' 'shut off' = idle ']' ++ return 1 ++ return 1 + state='shut off' + '[' 1 -eq 0 ']' + return 0 + return 0 + exit 0 Please, it is very important to get this running as it should! Regards, rene -----Urspr?ngliche Nachricht----- Von: Lon Hohberger [mailto:lhh at redhat.com] Gesendet: Mittwoch, 30. September 2009 17:08 An: r.rosenberger at netbiscuits.com Cc: 'linux clustering' Betreff: Re: AW: [Linux-cluster] Problems starting a VM Service On Wed, 2009-09-30 at 15:25 +0200, Rene Rosenberger wrote: > Hi, > > rgmanager-2.0.52-1 > > [root at cluster-node02 ~]# cat /etc/cluster/cluster.conf > > > post_join_delay="3"/> > > nodeid="1" votes="1"> > > > > > > > nodeid="2" votes="1"> > > > > > > > > > > login="root" name="Fence_Device_01" passwd="emoveo11wap"/> > login="root" name="Fence_Device_02" passwd="emoveo11wap"/> > > > > nofailback="0" ordered="0" restricted="0"> > name="cluster-node01.netbiscuits.com" priority="1"/> > name="cluster-node02.netbiscuits.com" priority="1"/> > > > > migrate="live" name="Nagios" path="/rootfs/vm/" recovery="relocate"/> > migrate="live" name="Log-Server" path="/rootfs/vm/" recovery="relocate"/> > > > > Regards, rene Ok, so it's not the one fixed here: http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=a9ac1e635c559b4651 2cf4251fe71c015bb6d70f I don't recall if this will matter much, but get rid of the trailing slash on /rootfs/vm/. Also, ensure /rootfs/vm/Nagios and /root/fs/vm/Log-Server file names match the names contained within the respective config files. (e.g. name = "Nagios" / name = "Log-Server" ) This is because rgmanager wants a vm "name" but xm wants a "config file" - so they have to match. -- Lon From merhar at arlut.utexas.edu Thu Oct 1 12:39:59 2009 From: merhar at arlut.utexas.edu (David Merhar) Date: Thu, 1 Oct 2009 07:39:59 -0500 Subject: [Linux-cluster] Cluster 3.0.3: compile error Message-ID: RHEL 5.4 kernel 2.6.31.1 corosync 1.1.0 openais 1.1.0 corosync and openais install without issue. ... /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c: In function 'device_geometry': /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: 'O_CLOEXEC' undeclared (first use in this function) /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: (Each undeclared identifier is reported only once /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: for each function it appears in.) My testing environment is vm. Any place to start looking? Thanks djm From brem.belguebli at gmail.com Thu Oct 1 14:52:51 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Thu, 1 Oct 2009 16:52:51 +0200 Subject: [Linux-cluster] Fwd: CLVM exclusive mode In-Reply-To: <29ae894c0908171146u5130bb71t1e8a573eda5f3311@mail.gmail.com> References: <4A7AA7DE02000027000536E3@lucius.provo.novell.com> <4A7A8240.7000206@redhat.com> <29ae894c0908160459m34dd02dcp8601e63d7bf62c3a@mail.gmail.com> <1cafab770908170417n16919354h38e22bcb97d31bf7@mail.gmail.com> <29ae894c0908170710t14071b3ai59f793b9390b9b38@mail.gmail.com> <20090817181205.GK24960@edu.joroinen.fi> <29ae894c0908171146u5130bb71t1e8a573eda5f3311@mail.gmail.com> Message-ID: <29ae894c0910010752x52e2c01en6aa3868e756546c0@mail.gmail.com> Hi , For those concerned by this topic, Redhat made an update that will hopefully be operational. https://bugzilla.redhat.com/show_bug.cgi?id=517900 Thx Regards 2009/8/17, brem belguebli : > done, > > Bug 517900 > > > 2009/8/17, Pasi K?rkk?inen : > > On Mon, Aug 17, 2009 at 04:10:01PM +0200, brem belguebli wrote: > > > Hi, > > > > > > Thanks a lot. > > > > > > I'll try it, but would be enjoyed if RH could implement it. > > > > > > > Did you already open bugzilla entry about it? > > > > Quote from this same thread: > > > > "I think it makes no sense at all, and have already said so on this list. > > As far as I know there is no bugzilla for this problem and therefore it > > isn't being worked on. > > > > So ... if you care about this ... you know what to do ;-) > > > > Chrissie" > > > > -- Pasi > > > > > Regards > > > > > > > > > 2009/8/17, Xinwei Hu : > > > > > > > > Hi, > > > > > > > > Attached a very naive try to solve the issue you have. > > > > > > > > Would you give it a test ? > > > > > > > > Thanks. > > > > > > > > 2009/8/16 brem belguebli : > > > > > Hi, > > > > > > > > > > I don't think adding more security can be considered as pointless, > > > > > especially when this has no impact on performance or behaviour. > > > > > The question is, what's the point in allowing the clustered active > > > > > exclusive lock to be bypassed ? > > > > > > > > > > In comparison to other volume management solutions (on various > unices) > > > > where > > > > > these barriers are already implemented, the lack of them on Linux > can be > > > > > seen as a weakness. > > > > > Regards > > > > > > > > > > > > > > > 2009/8/6, Christine Caulfield : > > > > >> > > > > >> On 06/08/09 02:52, Jia Ju Zhang wrote: > > > > >>> > > > > >>> Just RFC: > > > > >>> I noticed that 'vgchange -ay' can convert the lock which locked by > > > > >>> 'vgchange -aey' > > > > >>> from EX to CR. Is that acceptable to change the logic into always > > > > >>> allocating a new lock > > > > >>> rather than converting an existing lock? > > > > >>> In that case, 'vgchange -ay' won't change the result of 'vgchange > > > > -aey'. > > > > >>> But if we really > > > > >>> want to convert the lock, we can firstly invoke 'vgchange -aen' to > > > > >>> release the EX lock, > > > > >>> then invoke the 'vgchange -ay'. > > > > >>> > > > > >>> Does this make sense? Or what side effect it may introduce? > > > > >> > > > > >> > > > > >> I think it makes no sense at all, and have already said so on this > list. > > > > >> As far as I know there is no bugzilla for this problem and > therefore it > > > > >> isn't being worked on. > > > > >> > > > > >> So ... if you care about this ... you know what to do ;-) > > > > >> > > > > >> Chrissie > > > > >> > > > > >>>>>> On 8/6/2009 at 9:39 AM, in message<4A7A346B.A94 : 39 : 18251>, > Jia > > > > Ju > > > > >>>>>> Zhang > > > > >>> > > > > >>> wrote: > > > > >>>> > > > > >>>> On Fri, 2009-07-31 at 21:29 +0200, brem belguebli wrote: > > > > >>>>> > > > > >>>>> Hi, > > > > >>>>> > > > > >>>>> Same behaviour as the one from Rafael. > > > > >>>>> > > > > >>>>> Everything is coherent as long as you use the exclusive flag > from the > > > > >>>>> rogue node, the locking does the job. Deactivating an already > opened > > > > >>>>> VG (mounted lvol) is not possible either. How could this behave > in > > > > >>>>> case one used raw devices instead of FS ? > > > > >>>>> > > > > >>>>> But when you come to ignore the exclusive flag on the rogue node > > > > >>>>> (vgchange -a y vgXX) the locking is completely bypassed. It's > > > > >>>>> definitely here that the watchdog has to be (within the tools > > > > >>>>> lvchange, vgchange, or at dlm level). > > > > >>>> > > > > >>>> Is there an open bugzilla # for this? Would like to follow this > issue. > > > > >>>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Linux-cluster mailing list > > > > >>> Linux-cluster at redhat.com > > > > >>> > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > >> > > > > >> -- > > > > >> Linux-cluster mailing list > > > > >> Linux-cluster at redhat.com > > > > >> > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > -- > > > > > Linux-cluster mailing list > > > > > Linux-cluster at redhat.com > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > From rmicmirregs at gmail.com Thu Oct 1 20:57:03 2009 From: rmicmirregs at gmail.com (Rafael =?ISO-8859-1?Q?Mic=F3?= Miranda) Date: Thu, 01 Oct 2009 22:57:03 +0200 Subject: [Linux-cluster] Fwd: CLVM exclusive mode In-Reply-To: <29ae894c0910010752x52e2c01en6aa3868e756546c0@mail.gmail.com> References: <4A7AA7DE02000027000536E3@lucius.provo.novell.com> <4A7A8240.7000206@redhat.com> <29ae894c0908160459m34dd02dcp8601e63d7bf62c3a@mail.gmail.com> <1cafab770908170417n16919354h38e22bcb97d31bf7@mail.gmail.com> <29ae894c0908170710t14071b3ai59f793b9390b9b38@mail.gmail.com> <20090817181205.GK24960@edu.joroinen.fi> <29ae894c0908171146u5130bb71t1e8a573eda5f3311@mail.gmail.com> <29ae894c0910010752x52e2c01en6aa3868e756546c0@mail.gmail.com> Message-ID: <1254430623.7174.3.camel@mecatol> Hi Brem, El jue, 01-10-2009 a las 16:52 +0200, brem belguebli escribi?: > Hi , > > For those concerned by this topic, > Redhat made an update that will hopefully be operational. > > https://bugzilla.redhat.com/show_bug.cgi?id=517900 > > Thx > > Regards > Thanks a lot for the report, I just checked the bugzilla a couple days ago. If RedHat does fix this "bug", i'll try again to upload the lvm-cluster.sh resource script, with some kind of readme to make it into the project. Cheers and thanks again, Rafael -- Rafael Mic? Miranda From criley at erad.com Thu Oct 1 21:25:50 2009 From: criley at erad.com (Charles Riley) Date: Thu, 1 Oct 2009 17:25:50 -0400 (EDT) Subject: [Linux-cluster] fsid in RHEL 4 cluster suite Message-ID: <19203585.2631254432350603.JavaMail.root@boardwalk2.erad.com> Hi, When you add an ext3 filesystem in system-config-cluster on rhel 4, there is a "filesystem id" property that is automatically populated by some means if left blank. How does this fsid get generated? Charles Riley eRAD, Inc. From nicolas.ferre at univ-provence.fr Fri Oct 2 07:25:55 2009 From: nicolas.ferre at univ-provence.fr (=?ISO-8859-1?Q?Nicolas_Ferr=E9?=) Date: Fri, 02 Oct 2009 09:25:55 +0200 Subject: [Linux-cluster] Too large load on the login node In-Reply-To: <8a5668960910010246l6f9ab4ccw4c96d5a36ed11e02@mail.gmail.com> References: <4AC4781C.5080307@univ-provence.fr> <8a5668960910010246l6f9ab4ccw4c96d5a36ed11e02@mail.gmail.com> Message-ID: <4AC5AB03.8060500@univ-provence.fr> Juan Ramon Martin Blanco a ?crit : > On Thu, Oct 1, 2009 at 11:36 AM, Nicolas Ferr? > wrote: >> Hi, >> >> We recently installed a new cluster composed of 1 login node and several >> computing nodes running CentOS. These nodes share a GFS2 fs made of two >> partitions. >> >> A strange thing is that on each node, the activity load (as monitored by the >> 'top' command) is always larger than 1. After some googling, it seems >> someone already reported this problem but I can't see any solution. >> Moreover, on the login node, the load is even larger: >> top - 11:34:17 up 1 day, 22:07, 1 user, load average: 16.19, 16.20, 16.12 >> while there is no cpu-intensive running processes. >> > Which cluster-suite versions are you using? > Look for processes in state D or R (ps aux) and paste them here. > Hi, We are running the red hat cluster manager, cman-2.0.115-1.el5. Also installed gfs2-utils-0.1.62-1.el5, luci-0.12.1-7.3.el5.centos.1, ricci-0.12.1-7.3.el5.centos.1. About the processes, I still have to waut because we rebooted the cluster. The high load appears after some time. From nicolas.ferre at univ-provence.fr Fri Oct 2 07:28:14 2009 From: nicolas.ferre at univ-provence.fr (=?UTF-8?B?Tmljb2xhcyBGZXJyw6k=?=) Date: Fri, 02 Oct 2009 09:28:14 +0200 Subject: [Linux-cluster] Too large load on the login node In-Reply-To: <1254390405.2697.8.camel@localhost.localdomain> References: <4AC4781C.5080307@univ-provence.fr> <1254390405.2697.8.camel@localhost.localdomain> Message-ID: <4AC5AB8E.4050806@univ-provence.fr> Steven Whitehouse a ?crit : > Hi, > > On Thu, 2009-10-01 at 11:36 +0200, Nicolas Ferr? wrote: >> Hi, >> >> We recently installed a new cluster composed of 1 login node and several >> computing nodes running CentOS. These nodes share a GFS2 fs made of two >> partitions. >> >> A strange thing is that on each node, the activity load (as monitored by >> the 'top' command) is always larger than 1. After some googling, it >> seems someone already reported this problem but I can't see any solution. >> Moreover, on the login node, the load is even larger: >> top - 11:34:17 up 1 day, 22:07, 1 user, load average: 16.19, 16.20, 16.12 >> while there is no cpu-intensive running processes. >> >> Do you have an explanation? > > Processes in uninterruptible sleep are counted in the load average. One > of the gfs2 daemons (in early versions) was set to sleep in this way. I > suggest that you should upgrade to a more recent version (simply because > a number of other bugs have been fixed since then) although the > uninterruptible sleep is harmless aside from its effect on the load > average, > As far as I know, our system is up-to-date. > uname -a Linux slater.up.univ-mrs.fr 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linu > rpm -qa|grep gfs kmod-gfs-0.1.31-3.el5_3.1 gfs2-utils-0.1.62-1.el5 gfs-utils-0.1.18-1.el5 From mythtv at logic-q.nl Fri Oct 2 12:23:11 2009 From: mythtv at logic-q.nl (Hansa) Date: Fri, 2 Oct 2009 14:23:11 +0200 Subject: [Linux-cluster] cman_tool: aisexec daemon didn't start Message-ID: Hi, I'm trying to set up a virtual storage cluster (Xen) for testing reasons. For some reason the aisexec daemon won't start when executing cman: # service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] Also see attached logs and config files I'm a cluster noob so some help is greatly appreciated. Tnx. -------------- next part -------------- A non-text attachment was scrubbed... Name: messages.log Type: application/octet-stream Size: 2558 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: openais.conf Type: application/octet-stream Size: 260 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 490 bytes Desc: not available URL: From JRFlores at INNODATA-ISOGEN.COM Fri Oct 2 14:52:20 2009 From: JRFlores at INNODATA-ISOGEN.COM (Flores, John Robert) Date: Fri, 2 Oct 2009 22:52:20 +0800 Subject: [Linux-cluster] GFS Volume Failover problem in 2 node cluster Message-ID: <80DC14582C212946AFC6F8DF49FF085B0465236E@VIRMDEEXC01.MANDAUE.INNODATA.NET> Hi, I'm setting up a 2 node cluster using GFS on a SAN and everything is ok until one node is forcibly shutdown. The shutdown is done for testing the failover process. Initially it the service and resources fails over on the other node but once node 1 shutdowns and tries to umount the FS the active node 2 FS suddenly hangs. here's the output using the group_tool command. type level name id state fence 0 default 00010001 FAIL_START_WAIT [2] dlm 1 rgmanager 00020001 none [2] dlm 1 GFS 00040001 FAIL_ALL_STOPPED [1 2] dlm 1 clvmd 00050001 none [2] gfs 2 GFS 00030001 FAIL_ALL_STOPPED [1 2] Thanks, John Disclaimer: ----------- "This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this e-mail or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful." -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.clark at twcable.com Fri Oct 2 15:44:06 2009 From: sean.clark at twcable.com (Clark, Sean) Date: Fri, 2 Oct 2009 11:44:06 -0400 Subject: [Linux-cluster] cman_tool: aisexec daemon didn't start In-Reply-To: References: Message-ID: <5B2EC65098246C4B93F38754D4C739BC164848D956@PRVPEXVS11.corp.twcable.com> Update to openais-0.80.6-8.i386.rpm Search web for SRC rpm and build it I had the same problem - I am guessing you just recently did a yum update and rebooted a node? -Sean -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hansa Sent: Friday, October 02, 2009 8:23 AM To: linux-cluster at redhat.com Subject: [Linux-cluster] cman_tool: aisexec daemon didn't start Hi, I'm trying to set up a virtual storage cluster (Xen) for testing reasons. For some reason the aisexec daemon won't start when executing cman: # service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] Also see attached logs and config files I'm a cluster noob so some help is greatly appreciated. Tnx. This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. From allen at isye.gatech.edu Fri Oct 2 16:27:17 2009 From: allen at isye.gatech.edu (Allen Belletti) Date: Fri, 02 Oct 2009 12:27:17 -0400 Subject: [Linux-cluster] GFS2 read only node for backups Message-ID: <4AC629E5.2050100@isye.gatech.edu> Hi All, So, as I've mentioned here before, I run GFS2 on a two node mail cluster, generally with good success. One issue which I am trying to sort out is the backups. Currently we use rsync each night to create a backup, and we're storing 60 days' worth that way, using "--link-dest" to avoid creating 60 copies of each identical file. This works well, but slowly (7 hours per night), and the backups have quite a lot of performance impact on the production servers. Further, it is my *suspicion* that the very large amount locking traffic contributes to the fairly frequent "file stuck locked" issues which come up several times per month, requiring a reboot. Right now I'm in the process of migrating the cluster nodes to new hardware, which means I've got an "extra" node capable of mounting GFS2 and being experimented with. Browsing the man pages turned up the "spectator" mount option which seemed like exactly what I wanted -- the ability to do a read only mount that doesn't interfere with the rest of the cluster. To my surprise, it does indeed mount read-only but it still generates a huge amount of locking traffic on the back end network. Although this does keep our "production" nodes from accumulating hundreds of thousands of locks, and thus perhaps improves their reliability, I was hoping for more. Btw, "spectator" does not work in conjunction with "lockproto=lock_nolock". So next, I tried mounting with "ro,lockproto=lock_nolock" thinking that it would give me a purely non-interfering mount. This failed for two reasons. One, these startup messages scared me into thinking that the "recovery" process might corrupt the filesystem. Apparently "ro" doesn't quite mean "ro": > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=: Trying to join cluster > "lock_nolock", "mail_cluster:mail_fac" > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > Joined cluster. Now mounting FS... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0, already locked for use > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Looking at journal... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Acquiring the transaction lock... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > recovery required on read-only filesystem. > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > write access will be enabled during recovery. > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Replaying journal... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Replayed 26 of 27 blocks > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Found 1 revoke tags > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Journal replayed in 1s > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=0: Done > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Trying to acquire journal lock... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Looking at journal... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Acquiring the transaction lock... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > recovery required on read-only filesystem. > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > write access will be enabled during recovery. > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Replaying journal... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Replayed 28 of 34 blocks > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Found 6 revoke tags > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Journal replayed in 1s > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=1: Done > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=2: Trying to acquire journal lock... > Oct 1 18:34:12 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=2: Looking at journal... > Oct 1 18:34:13 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=2: Done > Oct 1 18:34:13 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=3: Trying to acquire journal lock... > Oct 1 18:34:13 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=3: Looking at journal... > Oct 1 18:34:13 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > jid=3: Done The second reason it failed is that after a couple of hours, the mount failed as follows: > Oct 1 19:19:24 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > fatal: invalid metadata block > Oct 1 19:19:24 post2-new kernel: GFS2: > fsid=mail_cluster:mail_fac.0: bh = 54432241 (magic number) > Oct 1 19:19:24 post2-new kernel: GFS2: > fsid=mail_cluster:mail_fac.0: function = gfs2_meta_indirect_buffer, > file = /builddir/build/B > UILD/gfs2-kmod-1.92/_kmod_build_/meta_io.c, line = 334 > Oct 1 19:19:24 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > about to withdraw this file system > Oct 1 19:19:24 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > telling LM to withdraw > Oct 1 19:19:24 post2-new kernel: GFS2: fsid=mail_cluster:mail_fac.0: > withdrawn (I have the call trace as well, if anybody's interested.) Thinking about this, it seems clear that the failure occurred because some other node changed things while my poor, confused read-only & no locking node was reading them. This makes sense. So I'm wondering two things: 1. What does spectator mode do exactly? Is it just the same as specifying "ro" or are there other optimizations? 2. Would it be possible to have a mount mode that's strictly read-only, no locking, and incorporates tolerance for errors? After all, I'm backing up Maildirs (a few million individual files) every night. If I miss a few messages one night, it's unlike to matter. So if we could return an i/o error for a particular file without withdrawing from the cluster, that would be wonderful. Better yet, why not purge the cached data relating to this particular file and read it from disk again. Most likely, that'll fetch valid data and the file will be accessible again. Thanks in advance for any thoughts that you might have! Allen -- Allen Belletti allen at isye.gatech.edu 404-894-6221 Phone Industrial and Systems Engineering 404-385-2988 Fax Georgia Institute of Technology From jeff.sturm at eprize.com Fri Oct 2 18:50:56 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 2 Oct 2009 14:50:56 -0400 Subject: [Linux-cluster] GFS2 read only node for backups In-Reply-To: <4AC629E5.2050100@isye.gatech.edu> References: <4AC629E5.2050100@isye.gatech.edu> Message-ID: <64D0546C5EBBD147B75DE133D798665F03F3EC03@hugo.eprize.local> > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] > On Behalf Of Allen Belletti > Sent: Friday, October 02, 2009 12:27 PM > To: linux clustering > Subject: [Linux-cluster] GFS2 read only node for backups > > 1. What does spectator mode do exactly? Is it just the same as > specifying "ro" or are there other optimizations? A GFS spectator mount cannot successfully mount a read-only block device. It requires read-write access to the device. So it behaves a little differently than an ordinary "ro" mount. Plus, with an ordinary "ro" mount the files will all be unchanging, whereas a spectator mount allows files and directories to be modified (by another node) while it is mounted, even though the node with the spectator mount cannot write to the filesystem. Thus the spectator mount is not strictly a read-only or read-write mount, but sort of a hybrid of the two. We don't use GFS2, but I am guessing it is similar in this regard. > 2. Would it be possible to have a mount mode that's strictly read-only, > no locking, and incorporates tolerance for errors? I don't know for certain, but I don't think the lock_nolock approach will work, unless you can snapshot the volume. We were faced with the same questions you have. Here's what we did: a) run "gfs_tool freeze" on the mounted GFS filesystem to suspend activity. b) take a volume snapshot on our SAN block storage device. c) run "gfs_tool unfreeze" to resume filesystem activity. d) repeat a-c for each mounted GFS filesystem. Once you have a snapshot, it of course can be mounted exclusively on a single node with lockproto=lock_nolock as the filesystem contents will be unchanging, and you can take as much time as necessary to rsync its contents to another storage device without impact to the active GFS filesystem (assuming sufficient disk/network bandwidth are available for the transfer). -Jeff From Nitin.Choudhary at palm.com Fri Oct 2 19:10:11 2009 From: Nitin.Choudhary at palm.com (Nitin Choudhary) Date: Fri, 2 Oct 2009 12:10:11 -0700 Subject: [Linux-cluster] Dell iDRAC 6 Support for fencing device In-Reply-To: <4AC3E007.8000405@ntsg.umt.edu> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com><29ae894c0909251422h584b6f44hbb90e45fabe689d8@mail.gmail.com><8b711df40909251455m34ceb268q4ca54f65a9a0bdd3@mail.gmail.com><29ae894c0909251507g3c6bd665j8cb379125f0f83b8@mail.gmail.com><8b711df40909251524w3dba10ddr9a8fbce6544f5c42@mail.gmail.com><29ae894c0909251553u209abddane9b48a4f5390c6b7@mail.gmail.com><8b711df40909280803i25cc920aq8a1819ffbaa5aaa6@mail.gmail.com><29ae894c0909281528u9cb9497h8fa7a2377468abff@mail.gmail.com><29ae894c0909281546j69186706t61399a5cd5d4c130@mail.gmail.com><8b711df40909281620r32163721j60ea5d75028d92de@mail.gmail.com><29ae894c0909281641n3479f380ge68b2077ab6b0665@mail.gmail.com> <6683EA6E3C0A4B0E8881E698DDA3AD3D@homeuser> <4AC3E007.8000405@ntsg.umt.edu> Message-ID: Hi! I have pasted the modified version of fence script for Dell iDRAC6. Thanks, Nitin #!/usr/bin/perl # The following agent has been tested on: # # Model DRAC Version Firmware # ------------------- -------------- ---------------------- # PowerEdge 750 DRAC III/XT 3.20 (Build 10.25) # Dell Remote Access Controller - ERA and DRAC III/XT, v.3.20, A00 # # PowerEdge 1855 DRAC/MC 1.1 (Build 03.03) # PowerEdge 1855 DRAC/MC 1.2 (Build 03.03) # PowerEdge 1855 DRAC/MC 1.3 (Build 06.12) # PowerEdge 1850 DRAC 4/I 1.35 (Build 09.27) # PowerEdge 1850 DRAC 4/I 1.40 (Build 08.24) # PowerEdge 1950 DRAC 5 1.0 (Build 06.05.12) # use Getopt::Std; use Net::Telnet (); # Get the program name from $0 and strip directory names $_=$0; s/.*\///; my $pname = $_; my $telnet_timeout = 10; # Seconds to wait for matching telent response my $power_timeout = 20; # time to wait in seconds for power state changes $action = 'reboot'; # Default fence action. my $logged_in = 0; my $quiet = 0; my $t = new Net::Telnet; my $DRAC_VERSION_UNKNOWN = '__unknown__'; my $DRAC_VERSION_III_XT = 'DRAC III/XT'; my $DRAC_VERSION_MC = 'DRAC/MC'; my $DRAC_VERSION_4I = 'DRAC 4/I'; my $DRAC_VERSION_4P = 'DRAC 4/P'; my $DRAC_VERSION_5 = 'DRAC 5'; my $DRAC_VERSION_6 = 'DRAC 6'; my $PWR_CMD_SUCCESS = "/^OK/"; my $PWR_CMD_SUCCESS_DRAC5 = "/^Server power operation successful$/"; # WARNING!! Do not add code bewteen "#BEGIN_VERSION_GENERATION" and # "#END_VERSION_GENERATION" It is generated by the Makefile #BEGIN_VERSION_GENERATION $FENCE_RELEASE_NAME="2.0.115"; $REDHAT_COPYRIGHT=("Copyright (C) Red Hat, Inc. 2004 All rights reserved."); $BUILD_DATE="(built Wed Sep 2 11:45:31 EDT 2009)"; #END_VERSION_GENERATION sub usage { print "Usage:\n"; print "\n"; print "$pname [options]\n"; print "\n"; print "Options:\n"; print " -a IP address or hostname of DRAC\n"; print " -c force DRAC command prompt\n"; print " -d force DRAC version to use\n"; print " -D debugging output file\n"; print " -h usage\n"; print " -l Login name\n"; print " -m DRAC/MC module name\n"; print " -o Action: reboot (default), off or on\n"; print " -p Login password\n"; print " -S Script to run to retrieve password\n"; print " -q quiet mode\n"; print " -V version\n"; print "\n"; print "CCS Options:\n"; print " action = \"string\" Action: reboot (default), off or on\n"; print " debug = \"debugfile\" debugging output file\n"; print " ipaddr = \"ip\" IP address or hostname of DRAC\n"; print " login = \"name\" Login name\n"; print " passwd = \"string\" Login password\n"; print " passwd_script = \"path\" Script to run to retrieve password\n"; exit 0; } sub msg { ($msg)=@_; print $msg."\n" unless $quiet; } sub fail { ($msg)=@_; print $msg."\n" unless $quiet; if (defined $t) { # make sure we don't get stuck in a loop due to errors $t->errmode('return'); logout() if $logged_in; $t->close } exit 1; } sub fail_usage { ($msg)=@_; print STDERR $msg."\n" if $msg; print STDERR "Please use '-h' for usage.\n"; exit 1; } sub version { print "$pname $FENCE_RELEASE_NAME $BUILD_DATE\n"; print "$REDHAT_COPYRIGHT\n" if ( $REDHAT_COPYRIGHT ); exit 0; } sub login { $t->open($address) or fail "failed: telnet open failed: ". $t->errmsg."\n"; # Expect 'Login: ' ($_) = $t->waitfor(Match => "/[Ll]ogin: /", Timeout=>15) or fail "failed: telnet failed: ". $t->errmsg."\n" ; # Determine DRAC version if (/Dell Embedded Remote Access Controller \(ERA\)\nFirmware Version/m) { $drac_version = $DRAC_VERSION_III_XT; } else { if (/.*\((DRAC[^)]*)\)/m) { print "detected drac version '$1'\n" if $verbose; $drac_version = $1 unless defined $drac_version; print "WARNING: detected drac version '$1' but using " . "user defined version '$drac_version'\n" if ($drac_version ne $1); } else { $drac_version = $DRAC_VERSION_UNKNOWN; } } # Setup prompt if ($drac_version =~ /$DRAC_VERSION_III_XT/) { $cmd_prompt = "/\\[$login\\]# /" unless defined $cmd_prompt; } elsif ($drac_version =~ /$DRAC_VERSION_MC/) { $cmd_prompt = "/DRAC\\/MC:/" unless defined $cmd_prompt; } elsif ($drac_version =~ /$DRAC_VERSION_4I/) { $cmd_prompt = "/\\[$login\\]# /" unless defined $cmd_prompt; } elsif ($drac_version =~ /$DRAC_VERSION_4P/) { $cmd_prompt = "/\\[$login\\]# /" unless defined $cmd_prompt; } else { $drac_version = $DRAC_VERSION_UNKNOWN; } # Take a guess as to what the prompt might be if not already defined $cmd_prompt="/(\\[$login\\]# |DRAC\\/MC:|\\\$ )/" unless defined $cmd_prompt; # Send login $t->print($login); # Expect 'Password: ' $t->waitfor("/Password: /") or fail "failed: timeout waiting for password"; # Send password $t->print($passwd); # DRAC5 prints version controller version info # only after you've logged in. if ($drac_version eq $DRAC_VERSION_UNKNOWN) { if ($t->waitfor(Match => "/.*\($DRAC_VERSION_5\)/m")) { $drac_version = $DRAC_VERSION_5; $cmd_prompt = "/\\\$ /"; $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5; } elsif ($t->waitfor(Match => "/.*\(admin\)/m")) { $drac_version = $DRAC_VERSION_5; $cmd_prompt = '/> $/'; $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5; } else { print "WARNING: unable to detect DRAC version '$_'\n"; } } $t->waitfor($cmd_prompt) or fail "failed: invalid username or password"; if ($drac_version eq $DRAC_VERSION_UNKNOWN) { print "WARNING: unsupported DRAC version '$drac_version'\n"; } $logged_in = 1; } # # Set the power status of the node # sub set_power_status { my ($state,$dummy) = @_; my $cmd,$svr_action; if ( $state =~ /^on$/) { $svr_action = "powerup" } elsif( $state =~ /^off$/) { $svr_action = "powerdown" } if ($drac_version eq $DRAC_VERSION_MC) { $cmd = "serveraction -m $modulename -d 0 $svr_action"; } elsif ($drac_version eq $DRAC_VERSION_5) { $cmd = "racadm serveraction $svr_action"; } else { $cmd = "serveraction -d 0 $svr_action"; } $t->print($cmd); # Expect /$cmd_prompt/ ($_) = $t->waitfor($cmd_prompt) or fail "failed: unexpected serveraction response"; my @cmd_out = split /\n/; # discard command sent to DRAC $_ = shift @cmd_out; s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; #strip ansi chars s/^.*\x0D//; fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/); # Additional lines of output probably means an error. # Aborting to be safe. Note: additional user debugging will be # necessary, run with -D and -v flags my $err; while (@cmd_out) { $_ = shift @cmd_out; #firmware vers 1.2 on DRAC/MC sends ansi chars - evil s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; s/^.*\x0D//; next if (/^\s*$/); # skip empty lines if (defined $err) { $err = $err."\n$_"; } else { next if ($PWR_CMD_SUCCESS); $err = $_; } } fail "failed: unexpected response: '$err'" if defined $err; } # # get the power status of the node and return it in $status and $_ # sub get_power_status { my $status; my $modname = $modulename; my $cmd; if ($drac_version eq $DRAC_VERSION_5) { $cmd = "racadm serveraction powerstatus"; } else { $cmd = "getmodinfo"; } $t->print($cmd); ($_) = $t->waitfor($cmd_prompt); my $found_header = 0; my $found_module = 0; my @cmd_out = split /\n/; # discard command sent to DRAC $_ = shift @cmd_out; #strip ansi control chars s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; s/^.*\x0D//; fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/); if ($drac_version ne $DRAC_VERSION_5) { #Expect: # # # 1 ----> chassis Present ON Normal CQXYV61 # # Note: DRAC/MC has many entries in the table whereas DRAC III has only # a single table entry. while (1) { $_ = shift @cmd_out; if (/^#\s*\s*\s*\s*\s*/) { $found_header = 1; last; } } fail "failed: invalid 'getmodinfo' header: '$_'" unless $found_header; } foreach (@cmd_out) { s/^\s+//g; #strip leading space s/\s+$//g; #strip training space if ($drac_version eq $DRAC_VERSION_5) { if(m/^Server power status: (\w+)/) { $status = lc($1); } } else { my ($group,$arrow,$module,$presence,$pwrstate,$health, $svctag,$junk) = split /\s+/; if ($drac_version eq $DRAC_VERSION_III_XT || $drac_version eq $DRAC_VERSION_4I || $drac_version eq $DRAC_VERSION_4P) { fail "failed: extraneous output detected from 'getmodinfo'" if $found_module; $found_module = 1; $modname = $module; } if ($modname eq $module) { fail "failed: duplicate module names detected" if $status; $found_module = 1; fail "failed: module not reported present" unless ($presence =~ /Present/); $status = $pwrstate; } } } if ($drac_version eq $DRAC_VERSION_MC) { fail "failed: module '$modulename' not detected" unless $found_module; } $_=$status; if(/^(on|off)$/i) { # valid power states } elsif ($status) { fail "failed: unknown power state '$status'"; } else { fail "failed: unable to determine power state"; } } # Wait upto $power_timeout seconds for node power state to change to # $state before erroring out. # # return 1 on success # return 0 on failure # sub wait_power_status { my ($state,$dummy) = @_; my $status; $state = lc $state; for (my $i=0; $i<$power_timeout ; $i++) { get_power_status; $status = $_; my $check = lc $status; if ($state eq $check ) { return 1 } sleep 1; } $_ = "timed out waiting to power $state"; return 0; } # # logout of the telnet session # sub logout { $t->print(""); $t->print("exit"); } # # error routine for Net::Telnet instance # sub telnet_error { fail "failed: telnet returned: ".$t->errmsg."\n"; } # # execute the action. Valid actions are 'on' 'off' 'reboot' and 'status'. # TODO: add 'configure' that uses racadm rpm to enable telnet on the drac # sub do_action { get_power_status; my $status = $_; if ($action =~ /^on$/i) { if ($status =~ /^on$/i) { msg "success: already on"; return; } set_power_status on; fail "failed: $_" unless wait_power_status on; msg "success: powered on"; } elsif ($action =~ /^off$/i) { if ($status =~ /^off$/i) { msg "success: already off"; return; } set_power_status off; fail "failed: $_" unless wait_power_status off; msg "success: powered off"; } elsif ($action =~ /^reboot$/i) { if ( !($status =~ /^off$/i) ) { set_power_status off; } fail "failed: $_" unless wait_power_status off; set_power_status on; fail "failed: $_" unless wait_power_status on; msg "success: rebooted"; } elsif ($action =~ /^status$/i) { msg "status: $status"; return; } else { fail "failed: unrecognised action: '$action'"; } } # # Decipher STDIN parameters # sub get_options_stdin { my $opt; my $line = 0; while( defined($in = <>) ) { $_ = $in; chomp; # strip leading and trailing whitespace s/^\s*//; s/\s*$//; # skip comments next if /^#/; $line+=1; $opt=$_; next unless $opt; ($name,$val)=split /\s*=\s*/, $opt; if ( $name eq "" ) { print STDERR "parse error: illegal name in option $line\n"; exit 2; } # DO NOTHING -- this field is used by fenced elsif ($name eq "agent" ) { } elsif ($name eq "ipaddr" ) { $address = $val; } elsif ($name eq "login" ) { $login = $val; } elsif ($name eq "action" ) { $action = $val; } elsif ($name eq "passwd" ) { $passwd = $val; } elsif ($name eq "passwd_script" ) { $passwd_script = $val; } elsif ($name eq "debug" ) { $debug = $val; } elsif ($name eq "modulename" ) { $modulename = $val; } elsif ($name eq "drac_version" ) { $drac_version = $val; } elsif ($name eq "cmd_prompt" ) { $cmd_prompt = $val; } # Excess name/vals will fail else { fail "parse error: unknown option \"$opt\""; } } } ### MAIN ####################################################### # # Check parameters # if (@ARGV > 0) { getopts("a:c:d:D:hl:m:o:p:S:qVv") || fail_usage ; usage if defined $opt_h; version if defined $opt_V; $quiet = 1 if defined $opt_q; $debug = $opt_D; fail_usage "Unknown parameter." if (@ARGV > 0); fail_usage "No '-a' flag specified." unless defined $opt_a; $address = $opt_a; fail_usage "No '-l' flag specified." unless defined $opt_l; $login = $opt_l; $modulename = $opt_m if defined $opt_m; if (defined $opt_S) { $pwd_script_out = `$opt_S`; chomp($pwd_script_out); if ($pwd_script_out) { $opt_p = $pwd_script_out; } } fail_usage "No '-p' or '-S' flag specified." unless defined $opt_p; $passwd = $opt_p; $verbose = $opt_v if defined $opt_v; $cmd_prompt = $opt_c if defined $opt_c; $drac_version = $opt_d if defined $opt_d; if ($opt_o) { fail_usage "Unrecognised action '$opt_o' for '-o' flag" unless $opt_o =~ /^(Off|On|Reboot|status)$/i; $action = $opt_o; } } else { get_options_stdin(); fail "failed: no IP address" unless defined $address; fail "failed: no login name" unless defined $login; if (defined $passwd_script) { $pwd_script_out = `$passwd_script`; chomp($pwd_script_out); if ($pwd_script_out) { $passwd = $pwd_script_out; } } fail "failed: no password" unless defined $passwd; fail "failed: unrecognised action: $action" unless $action =~ /^(Off|On|Reboot|status)$/i; } $t->timeout($telnet_timeout); $t->input_log($debug) if $debug; $t->errmode('return'); login; # Abort on failure beyond here $t->errmode(\&telnet_error); if ($drac_version eq $DRAC_VERSION_III_XT) { fail "failed: option 'modulename' not compatilble with DRAC version '$drac_version'" if defined $modulename; } elsif ($drac_version eq $DRAC_VERSION_MC) { fail "failed: option 'modulename' required for DRAC version '$drac_version'" unless defined $modulename; } do_action; logout; exit 0; -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew A. Neuschwander Sent: Wednesday, September 30, 2009 3:48 PM To: linux clustering Subject: Re: [Linux-cluster] Dell iDRAC 6 Support for fencing device Could you post your modified fence_drac for iDRAC 6? Thanks, -A -- Andrew A. Neuschwander, RHCE Systems/Software Engineer College of Forestry and Conservation The University of Montana http://www.ntsg.umt.edu andrew at ntsg.umt.edu - 406.243.6310 Nitin Choudhary wrote: > Hi! > > With small modification to fence_drac script it is working now. > > Thanks, > > Nitin > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Louis > Sent: Tuesday, September 29, 2009 6:16 PM > To: linux clustering > Subject: Re: [Linux-cluster] Dell iDRAC 6 Support for fencing device > > Hi, > > I used ipmilan to bypass the iDREC6 fencing. > > name="xxxxx" passwd="yyyy"> > > > Regards > Louis > ----- Original Message ----- > From: "Nitin Choudhary" > To: "linux clustering" > Sent: Tuesday, September 29, 2009 1:18 PM > Subject: [Linux-cluster] Dell iDRAC 6 Support for fencing device > > >> Hi! >> >> It seems that iDREC6 is not supported as fencing devices. >> >> Has anyone setup this before. Is there any workaround for this. >> >> Thanks, >> >> Nitin >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From kbphillips80 at gmail.com Fri Oct 2 21:12:29 2009 From: kbphillips80 at gmail.com (Kaerka Phillips) Date: Fri, 2 Oct 2009 17:12:29 -0400 Subject: [Linux-cluster] Dell iDRAC 6 Support for fencing device In-Reply-To: References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909281528u9cb9497h8fa7a2377468abff@mail.gmail.com> <29ae894c0909281546j69186706t61399a5cd5d4c130@mail.gmail.com> <8b711df40909281620r32163721j60ea5d75028d92de@mail.gmail.com> <29ae894c0909281641n3479f380ge68b2077ab6b0665@mail.gmail.com> <6683EA6E3C0A4B0E8881E698DDA3AD3D@homeuser> <4AC3E007.8000405@ntsg.umt.edu> Message-ID: Seeing as there is already a separate script for the DRAC 5 cards with beyond 1.20 firmware, perhaps this should be there or as a separate fence_drac moule? # file /sbin/fence_drac /sbin/fence_drac: perl script text executable # file /sbin/fence_drac5 /sbin/fence_drac5: python script text executable # I've found that the fence_drac no longer works with ver 1.34 and above firmware, but does work with 1.20 and below firmware. Any Dell PE 1950, 2950, or R series server with a DRAC5 will have newer than 1.20 firmware though if the server was manufactured in the last year or so. On Fri, Oct 2, 2009 at 3:10 PM, Nitin Choudhary wrote: > Hi! > > I have pasted the modified version of fence script for Dell iDRAC6. > Thanks, > Nitin > > #!/usr/bin/perl > > # The following agent has been tested on: > # > # Model DRAC Version Firmware > # ------------------- -------------- ---------------------- > # PowerEdge 750 DRAC III/XT 3.20 (Build 10.25) > # Dell Remote Access Controller - ERA and DRAC III/XT, v.3.20, A00 # > # PowerEdge 1855 DRAC/MC 1.1 (Build 03.03) > # PowerEdge 1855 DRAC/MC 1.2 (Build 03.03) > # PowerEdge 1855 DRAC/MC 1.3 (Build 06.12) > # PowerEdge 1850 DRAC 4/I 1.35 (Build 09.27) > # PowerEdge 1850 DRAC 4/I 1.40 (Build 08.24) > # PowerEdge 1950 DRAC 5 1.0 (Build 06.05.12) > # > > use Getopt::Std; > use Net::Telnet (); > > # Get the program name from $0 and strip directory names $_=$0; s/.*\///; > my $pname = $_; > > my $telnet_timeout = 10; # Seconds to wait for matching telent > response > my $power_timeout = 20; # time to wait in seconds for power state > changes > $action = 'reboot'; # Default fence action. > > my $logged_in = 0; > my $quiet = 0; > > my $t = new Net::Telnet; > > my $DRAC_VERSION_UNKNOWN = '__unknown__'; > my $DRAC_VERSION_III_XT = 'DRAC III/XT'; > my $DRAC_VERSION_MC = 'DRAC/MC'; > my $DRAC_VERSION_4I = 'DRAC 4/I'; > my $DRAC_VERSION_4P = 'DRAC 4/P'; > my $DRAC_VERSION_5 = 'DRAC 5'; > my $DRAC_VERSION_6 = 'DRAC 6'; > > my $PWR_CMD_SUCCESS = "/^OK/"; > my $PWR_CMD_SUCCESS_DRAC5 = "/^Server power operation successful$/"; > > # WARNING!! Do not add code bewteen "#BEGIN_VERSION_GENERATION" and # > "#END_VERSION_GENERATION" It is generated by the Makefile > > #BEGIN_VERSION_GENERATION > $FENCE_RELEASE_NAME="2.0.115"; > $REDHAT_COPYRIGHT=("Copyright (C) Red Hat, Inc. 2004 All rights > reserved."); $BUILD_DATE="(built Wed Sep 2 11:45:31 EDT 2009)"; > #END_VERSION_GENERATION > > sub usage > { > print "Usage:\n"; > print "\n"; > print "$pname [options]\n"; > print "\n"; > print "Options:\n"; > print " -a IP address or hostname of DRAC\n"; > print " -c force DRAC command prompt\n"; > print " -d force DRAC version to use\n"; > print " -D debugging output file\n"; > print " -h usage\n"; > print " -l Login name\n"; > print " -m DRAC/MC module name\n"; > print " -o Action: reboot (default), off or on\n"; > print " -p Login password\n"; > print " -S Script to run to retrieve password\n"; > print " -q quiet mode\n"; > print " -V version\n"; > print "\n"; > print "CCS Options:\n"; > print " action = \"string\" Action: reboot (default), off or > on\n"; > print " debug = \"debugfile\" debugging output file\n"; > print " ipaddr = \"ip\" IP address or hostname of DRAC\n"; > print " login = \"name\" Login name\n"; > print " passwd = \"string\" Login password\n"; > print " passwd_script = \"path\" Script to run to retrieve > password\n"; > > exit 0; > } > > sub msg > { > ($msg)=@_; > print $msg."\n" unless $quiet; > } > > sub fail > { > ($msg)=@_; > print $msg."\n" unless $quiet; > > if (defined $t) > { > # make sure we don't get stuck in a loop due to errors > $t->errmode('return'); > > logout() if $logged_in; > $t->close > } > exit 1; > } > > sub fail_usage > { > ($msg)=@_; > print STDERR $msg."\n" if $msg; > print STDERR "Please use '-h' for usage.\n"; > exit 1; > } > > sub version > { > print "$pname $FENCE_RELEASE_NAME $BUILD_DATE\n"; > print "$REDHAT_COPYRIGHT\n" if ( $REDHAT_COPYRIGHT ); > exit 0; > } > > > sub login > { > $t->open($address) or > fail "failed: telnet open failed: ". $t->errmsg."\n"; > > # Expect 'Login: ' > ($_) = $t->waitfor(Match => "/[Ll]ogin: /", Timeout=>15) or > fail "failed: telnet failed: ". $t->errmsg."\n" ; > > # Determine DRAC version > if (/Dell Embedded Remote Access Controller \(ERA\)\nFirmware Version/m) > { > $drac_version = $DRAC_VERSION_III_XT; > } else { > if (/.*\((DRAC[^)]*)\)/m) > { > print "detected drac version '$1'\n" if $verbose; > $drac_version = $1 unless defined $drac_version; > > print "WARNING: detected drac version '$1' but using " > . "user defined version '$drac_version'\n" > if ($drac_version ne $1); > } > else > { > $drac_version = $DRAC_VERSION_UNKNOWN; > } > } > > # Setup prompt > if ($drac_version =~ /$DRAC_VERSION_III_XT/) > { > $cmd_prompt = "/\\[$login\\]# /" > unless defined $cmd_prompt; > } > elsif ($drac_version =~ /$DRAC_VERSION_MC/) > { > $cmd_prompt = "/DRAC\\/MC:/" > unless defined $cmd_prompt; > } > elsif ($drac_version =~ /$DRAC_VERSION_4I/) > { > $cmd_prompt = "/\\[$login\\]# /" > unless defined $cmd_prompt; > } > elsif ($drac_version =~ /$DRAC_VERSION_4P/) > { > $cmd_prompt = "/\\[$login\\]# /" > unless defined $cmd_prompt; > } > else > { > $drac_version = $DRAC_VERSION_UNKNOWN; > } > > # Take a guess as to what the prompt might be if not already defined > $cmd_prompt="/(\\[$login\\]# |DRAC\\/MC:|\\\$ )/" unless defined > $cmd_prompt; > > > # Send login > $t->print($login); > > # Expect 'Password: ' > $t->waitfor("/Password: /") or > fail "failed: timeout waiting for password"; > > # Send password > $t->print($passwd); > > # DRAC5 prints version controller version info > # only after you've logged in. > if ($drac_version eq $DRAC_VERSION_UNKNOWN) { > if ($t->waitfor(Match => "/.*\($DRAC_VERSION_5\)/m")) { > $drac_version = $DRAC_VERSION_5; > $cmd_prompt = "/\\\$ /"; > $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5; > } elsif ($t->waitfor(Match => "/.*\(admin\)/m")) { > $drac_version = $DRAC_VERSION_5; > $cmd_prompt = '/> $/'; > $PWR_CMD_SUCCESS = $PWR_CMD_SUCCESS_DRAC5; > } else { > print "WARNING: unable to detect DRAC version > '$_'\n"; > } > } > > $t->waitfor($cmd_prompt) or > fail "failed: invalid username or password"; > > if ($drac_version eq $DRAC_VERSION_UNKNOWN) { > print "WARNING: unsupported DRAC version '$drac_version'\n"; > } > > $logged_in = 1; > } > > # > # Set the power status of the node > # > sub set_power_status > { > my ($state,$dummy) = @_; > my $cmd,$svr_action; > > if ( $state =~ /^on$/) { $svr_action = "powerup" } > elsif( $state =~ /^off$/) { $svr_action = "powerdown" } > > if ($drac_version eq $DRAC_VERSION_MC) > { > $cmd = "serveraction -m $modulename -d 0 $svr_action"; > } > elsif ($drac_version eq $DRAC_VERSION_5) { > $cmd = "racadm serveraction $svr_action"; > } else > { > $cmd = "serveraction -d 0 $svr_action"; > } > > $t->print($cmd); > > # Expect /$cmd_prompt/ > ($_) = $t->waitfor($cmd_prompt) or > fail "failed: unexpected serveraction response"; > > my @cmd_out = split /\n/; > > # discard command sent to DRAC > $_ = shift @cmd_out; > s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; #strip ansi chars > s/^.*\x0D//; > > fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/); > > # Additional lines of output probably means an error. > # Aborting to be safe. Note: additional user debugging will be > # necessary, run with -D and -v flags > my $err; > while (@cmd_out) > { > $_ = shift @cmd_out; > #firmware vers 1.2 on DRAC/MC sends ansi chars - evil > s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; > s/^.*\x0D//; > > next if (/^\s*$/); # skip empty lines > if (defined $err) > { > $err = $err."\n$_"; > } > else > { > next if ($PWR_CMD_SUCCESS); > $err = $_; > } > } > fail "failed: unexpected response: '$err'" if defined $err; } > > > # > # get the power status of the node and return it in $status and $_ # sub > get_power_status { > my $status; > my $modname = $modulename; > my $cmd; > > if ($drac_version eq $DRAC_VERSION_5) { > $cmd = "racadm serveraction powerstatus"; > } else { > $cmd = "getmodinfo"; > } > > $t->print($cmd); > > ($_) = $t->waitfor($cmd_prompt); > > my $found_header = 0; > my $found_module = 0; > > my @cmd_out = split /\n/; > > # discard command sent to DRAC > $_ = shift @cmd_out; > #strip ansi control chars > s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g; > s/^.*\x0D//; > > fail "failed: unkown dialog exception: '$_'" unless (/^$cmd$/); > > if ($drac_version ne $DRAC_VERSION_5) { > #Expect: > # # > > # 1 ----> chassis Present ON Normal > CQXYV61 > # > # Note: DRAC/MC has many entries in the table whereas DRAC > III has only > # a single table entry. > > while (1) > { > $_ = shift @cmd_out; > if > (/^#\s*\s*\s*\s*\s*/) > { > $found_header = 1; > last; > } > } > fail "failed: invalid 'getmodinfo' header: '$_'" unless > $found_header; > } > > foreach (@cmd_out) > { > s/^\s+//g; #strip leading space > s/\s+$//g; #strip training space > > if ($drac_version eq $DRAC_VERSION_5) { > if(m/^Server power status: (\w+)/) { > $status = lc($1); > } > } else { > my > ($group,$arrow,$module,$presence,$pwrstate,$health, > $svctag,$junk) = split /\s+/; > > if ($drac_version eq $DRAC_VERSION_III_XT || > $drac_version eq $DRAC_VERSION_4I || $drac_version eq $DRAC_VERSION_4P) > { > fail "failed: extraneous output detected > from 'getmodinfo'" if $found_module; > $found_module = 1; > $modname = $module; > } > > if ($modname eq $module) > { > fail "failed: duplicate module names > detected" if $status; > $found_module = 1; > > fail "failed: module not reported present" > unless ($presence =~ /Present/); > $status = $pwrstate; > } > > } > } > > if ($drac_version eq $DRAC_VERSION_MC) > { > fail "failed: module '$modulename' not detected" unless > $found_module; > } > > $_=$status; > if(/^(on|off)$/i) > { > # valid power states > } > elsif ($status) > { > fail "failed: unknown power state '$status'"; > } > else > { > fail "failed: unable to determine power state"; > } > } > > > # Wait upto $power_timeout seconds for node power state to change to # > $state before erroring out. > # > # return 1 on success > # return 0 on failure > # > sub wait_power_status > { > my ($state,$dummy) = @_; > my $status; > > $state = lc $state; > > for (my $i=0; $i<$power_timeout ; $i++) > { > get_power_status; > $status = $_; > my $check = lc $status; > > if ($state eq $check ) { return 1 } > sleep 1; > } > $_ = "timed out waiting to power $state"; > return 0; > } > > # > # logout of the telnet session > # > sub logout > { > $t->print(""); > $t->print("exit"); > } > > # > # error routine for Net::Telnet instance # sub telnet_error { > fail "failed: telnet returned: ".$t->errmsg."\n"; } > > # > # execute the action. Valid actions are 'on' 'off' 'reboot' and 'status'. > # TODO: add 'configure' that uses racadm rpm to enable telnet on the drac # > sub do_action { > get_power_status; > my $status = $_; > > if ($action =~ /^on$/i) > { > if ($status =~ /^on$/i) > { > msg "success: already on"; > return; > } > > set_power_status on; > fail "failed: $_" unless wait_power_status on; > > msg "success: powered on"; > } > elsif ($action =~ /^off$/i) > { > if ($status =~ /^off$/i) > { > msg "success: already off"; > return; > } > > set_power_status off; > fail "failed: $_" unless wait_power_status off; > > msg "success: powered off"; > } > elsif ($action =~ /^reboot$/i) > { > if ( !($status =~ /^off$/i) ) > { > set_power_status off; > } > fail "failed: $_" unless wait_power_status off; > > set_power_status on; > fail "failed: $_" unless wait_power_status on; > > msg "success: rebooted"; > } > elsif ($action =~ /^status$/i) > { > msg "status: $status"; > return; > } > else > { > fail "failed: unrecognised action: '$action'"; > } > } > > # > # Decipher STDIN parameters > # > sub get_options_stdin > { > my $opt; > my $line = 0; > while( defined($in = <>) ) > { > $_ = $in; > chomp; > > # strip leading and trailing whitespace > s/^\s*//; > s/\s*$//; > > # skip comments > next if /^#/; > > $line+=1; > $opt=$_; > next unless $opt; > > ($name,$val)=split /\s*=\s*/, $opt; > > if ( $name eq "" ) > { > print STDERR "parse error: illegal name in option > $line\n"; > exit 2; > } > # DO NOTHING -- this field is used by fenced > elsif ($name eq "agent" ) > { > } > elsif ($name eq "ipaddr" ) > { > $address = $val; > } > elsif ($name eq "login" ) > { > $login = $val; > } > elsif ($name eq "action" ) > { > $action = $val; > } > elsif ($name eq "passwd" ) > { > $passwd = $val; > } > elsif ($name eq "passwd_script" ) > { > $passwd_script = $val; > } > elsif ($name eq "debug" ) > { > $debug = $val; > } > elsif ($name eq "modulename" ) > { > $modulename = $val; > } > elsif ($name eq "drac_version" ) > { > $drac_version = $val; > } > elsif ($name eq "cmd_prompt" ) > { > $cmd_prompt = $val; > } > # Excess name/vals will fail > else > { > fail "parse error: unknown option \"$opt\""; > } > } > } > > > ### MAIN ####################################################### > > # > # Check parameters > # > if (@ARGV > 0) { > getopts("a:c:d:D:hl:m:o:p:S:qVv") || fail_usage ; > > usage if defined $opt_h; > version if defined $opt_V; > > $quiet = 1 if defined $opt_q; > $debug = $opt_D; > > fail_usage "Unknown parameter." if (@ARGV > 0); > > fail_usage "No '-a' flag specified." unless defined $opt_a; > $address = $opt_a; > > fail_usage "No '-l' flag specified." unless defined $opt_l; > $login = $opt_l; > > $modulename = $opt_m if defined $opt_m; > > if (defined $opt_S) { > $pwd_script_out = `$opt_S`; > chomp($pwd_script_out); > if ($pwd_script_out) { > $opt_p = $pwd_script_out; > } > } > > fail_usage "No '-p' or '-S' flag specified." unless defined $opt_p; > $passwd = $opt_p; > > $verbose = $opt_v if defined $opt_v; > > $cmd_prompt = $opt_c if defined $opt_c; > $drac_version = $opt_d if defined $opt_d; > > if ($opt_o) > { > fail_usage "Unrecognised action '$opt_o' for '-o' flag" > unless $opt_o =~ /^(Off|On|Reboot|status)$/i; > $action = $opt_o; > } > > } else { > get_options_stdin(); > > fail "failed: no IP address" unless defined $address; > fail "failed: no login name" unless defined $login; > > if (defined $passwd_script) { > $pwd_script_out = `$passwd_script`; > chomp($pwd_script_out); > if ($pwd_script_out) { > $passwd = $pwd_script_out; > } > } > > fail "failed: no password" unless defined $passwd; > fail "failed: unrecognised action: $action" > unless $action =~ /^(Off|On|Reboot|status)$/i; } > > > $t->timeout($telnet_timeout); > $t->input_log($debug) if $debug; > $t->errmode('return'); > > login; > > # Abort on failure beyond here > $t->errmode(\&telnet_error); > > if ($drac_version eq $DRAC_VERSION_III_XT) { > fail "failed: option 'modulename' not compatilble with DRAC version > '$drac_version'" > if defined $modulename; > } > elsif ($drac_version eq $DRAC_VERSION_MC) { > fail "failed: option 'modulename' required for DRAC version > '$drac_version'" > unless defined $modulename; > } > > do_action; > > logout; > > exit 0; > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] On Behalf Of Andrew A. Neuschwander > Sent: Wednesday, September 30, 2009 3:48 PM > To: linux clustering > Subject: Re: [Linux-cluster] Dell iDRAC 6 Support for fencing device > > Could you post your modified fence_drac for iDRAC 6? > > Thanks, > -A > -- > Andrew A. Neuschwander, RHCE > Systems/Software Engineer > College of Forestry and Conservation > The University of Montana > http://www.ntsg.umt.edu > andrew at ntsg.umt.edu - 406.243.6310 > > > Nitin Choudhary wrote: > > Hi! > > > > With small modification to fence_drac script it is working now. > > > > Thanks, > > > > Nitin > > > > > > -----Original Message----- > > From: linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] On Behalf Of Louis > > Sent: Tuesday, September 29, 2009 6:16 PM > > To: linux clustering > > Subject: Re: [Linux-cluster] Dell iDRAC 6 Support for fencing device > > > > Hi, > > > > I used ipmilan to bypass the iDREC6 fencing. > > > > > name="xxxxx" passwd="yyyy"> > > > > > > Regards > > Louis > > ----- Original Message ----- > > From: "Nitin Choudhary" > > To: "linux clustering" > > Sent: Tuesday, September 29, 2009 1:18 PM > > Subject: [Linux-cluster] Dell iDRAC 6 Support for fencing device > > > > > >> Hi! > >> > >> It seems that iDREC6 is not supported as fencing devices. > >> > >> Has anyone setup this before. Is there any workaround for this. > >> > >> Thanks, > >> > >> Nitin > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mythtv at logic-q.nl Fri Oct 2 23:32:59 2009 From: mythtv at logic-q.nl (Hansa) Date: Sat, 3 Oct 2009 01:32:59 +0200 Subject: [Linux-cluster] cman_tool: aisexec daemon didn't start In-Reply-To: <5B2EC65098246C4B93F38754D4C739BC164848D956@PRVPEXVS11.corp.twcable.com> Message-ID: > Update to openais-0.80.6-8.i386.rpm > > Search web for SRC rpm and build it > > > I had the same problem - I am guessing you just recently did a > yum update and rebooted a node? That did the job! Thanks Sean > -----Original Message----- > From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hansa > Sent: Friday, October 02, 2009 8:23 AM > To: linux-cluster at redhat.com > Subject: [Linux-cluster] cman_tool: aisexec daemon didn't start > > Hi, > > I'm trying to set up a virtual storage cluster (Xen) for testing reasons. > For some reason the aisexec daemon won't start when executing cman: > > # service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... failed > /usr/sbin/cman_tool: aisexec daemon didn't start [FAILED] > > Also see attached logs and config files > > I'm a cluster noob so some help is greatly appreciated. > Tnx. > This E-mail and any of its attachments may contain Time Warner > Cable proprietary information, which is privileged, confidential, > or subject to copyright belonging to Time Warner Cable. This E-mail > is intended solely for the use of the individual or entity to which > it is addressed. If you are not the intended recipient of this > E-mail, you are hereby notified that any dissemination, > distribution, copying, or action taken in relation to the contents > of and attachments to this E-mail is strictly prohibited and may be > unlawful. If you have received this E-mail in error, please notify > the sender immediately and permanently delete the original and any > copy of this E-mail and any printout. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From nicolas.ferre at univ-provence.fr Sat Oct 3 11:13:51 2009 From: nicolas.ferre at univ-provence.fr (=?ISO-8859-1?Q?Nicolas_Ferr=E9?=) Date: Sat, 03 Oct 2009 13:13:51 +0200 Subject: [Linux-cluster] gfs2 partition withdrawn Message-ID: <4AC731EF.7080307@univ-provence.fr> Hi, We have a problem with our cluster, a gfs2 fs cannot be accessed some times after the system reboot. I have to manually umount/mount it. Here is the relevant part of /var/log/messages: Oct 3 11:46:14 slater kernel: GFS2: fsid=crcmm:home.1: fatal: invalid metadata block Oct 3 11:46:14 slater kernel: GFS2: fsid=crcmm:home.1: bh = 114419123 (magic number) Oct 3 11:46:14 slater kernel: GFS2: fsid=crcmm:home.1: function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 334 Oct 3 11:46:14 slater kernel: GFS2: fsid=crcmm:home.1: about to withdraw this file system Oct 3 11:46:14 slater kernel: GFS2: fsid=crcmm:home.1: telling LM to withdraw Oct 3 11:46:16 slater kernel: VFS:Filesystem freeze failed Oct 3 11:46:22 slater snmpd[7344]: Connection from UDP: [127.0.0.1]:58640 Oct 3 11:46:22 slater snmpd[7344]: Received SNMP packet(s) from UDP: [127.0.0.1]:58640 Oct 3 11:46:37 slater snmpd[7344]: Connection from UDP: [127.0.0.1]:47125 Oct 3 11:46:37 slater snmpd[7344]: Received SNMP packet(s) from UDP: [127.0.0.1]:47125 Oct 3 11:46:53 slater snmpd[7344]: Connection from UDP: [127.0.0.1]:33910 Oct 3 11:46:53 slater snmpd[7344]: Received SNMP packet(s) from UDP: [127.0.0.1]:33910 Oct 3 11:46:53 slater kernel: dlm: home: group leave failed -512 0 Oct 3 11:46:53 slater kernel: GFS2: fsid=crcmm:home.1: withdrawn Oct 3 11:46:53 slater kernel: Oct 3 11:46:53 slater kernel: Call Trace: Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_lm_withdraw+0xc1/0xd0 Oct 3 11:46:53 slater kernel: [] __wait_on_bit+0x60/0x6e Oct 3 11:46:53 slater kernel: [] sync_buffer+0x0/0x3f Oct 3 11:46:53 slater kernel: [] out_of_line_wait_on_bit+0x6c/0x78 Oct 3 11:46:53 slater kernel: [] wake_bit_function+0x0/0x23 Oct 3 11:46:53 slater kernel: [] submit_bh+0x10a/0x111 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_meta_check_ii+0x2c/0x38 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15f Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_getbuf+0x106/0x115 Oct 3 11:46:53 slater kernel: [] :gfs2:recursive_scan+0x96/0x175 Oct 3 11:46:53 slater kernel: [] :gfs2:recursive_scan+0x13c/0x175 Oct 3 11:46:53 slater kernel: [] :gfs2:do_strip+0x0/0x349 Oct 3 11:46:53 slater kernel: [] :gfs2:trunc_dealloc+0x99/0xe7 Oct 3 11:46:53 slater kernel: [] :gfs2:do_strip+0x0/0x349 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_delete_inode+0xdd/0x191 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_delete_inode+0x46/0x191 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_delete_inode+0x0/0x191 Oct 3 11:46:53 slater kernel: [] generic_delete_inode+0xc6/0x143 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_inplace_reserve_i+0x63b/0x691 Oct 3 11:46:53 slater kernel: [] __up_read+0x19/0x7f Oct 3 11:46:53 slater kernel: [] :gfs2:do_promote+0xf5/0x137 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_write_begin+0x16c/0x339 Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_file_buffered_write+0xf3/0x26c Oct 3 11:46:53 slater kernel: [] :gfs2:__gfs2_file_aio_write_nolock+0x258/0x28f Oct 3 11:46:53 slater kernel: [] :gfs2:gfs2_file_write_nolock+0xaa/0x10f Oct 3 11:46:54 slater kernel: [] generic_file_read+0xac/0xc5 Oct 3 11:46:54 slater kernel: [] autoremove_wake_function+0x0/0x2e Oct 3 11:46:54 slater kernel: [] :gfs2:gfs2_glock_schedule_for_reclaim+0x5d/0x9a Oct 3 11:46:54 slater kernel: [] autoremove_wake_function+0x0/0x2e Oct 3 11:46:54 slater kernel: [] :gfs2:gfs2_file_write+0x49/0xa7 Oct 3 11:46:54 slater kernel: [] vfs_write+0xce/0x174 Oct 3 11:46:54 slater kernel: [] sys_write+0x45/0x6e Oct 3 11:46:54 slater kernel: [] sysenter_do_call+0x1e/0x6a Oct 3 11:46:54 slater kernel: Oct 3 11:46:54 slater kernel: GFS2: fsid=crcmm:home.1: gfs2_delete_inode: -5 Can someone explain the meaning of such messages? And how to cure the problem ... Regards, -- Nicolas Ferre' Laboratoire Chimie Provence Universite' de Provence - France Tel: +33 491282733 http://sites.univ-provence.fr/lcp-ct From johannes.russek at io-consulting.net Sun Oct 4 12:02:39 2009 From: johannes.russek at io-consulting.net (jr) Date: Sun, 04 Oct 2009 14:02:39 +0200 Subject: [Linux-cluster] vm.sh with and without virsh Message-ID: <4AC88EDF.8040408@io-consulting.net> Hello everybody, I'm having trouble using the current vm.sh that uses virsh instead of xm commands. When i'm using the new one, the dom0s are unable to start any of my virtual machines. This is what i get: [root at PhySrv07 ~]# rg_test test /etc/cluster/cluster.conf start vm LVSDirector01 Running in test mode. Starting LVSDirector01... Hypervisor: xen Management tool: virsh Hypervisor URI: xen:/// Migration URI format: xenmigr://target_host/ Virtual machine LVSDirector01 is libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName error: failed to get domain 'LVSDirector01' virsh -c xen:/// start LVSDirector01 libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName error: failed to get domain 'LVSDirector01' Failed to start LVSDirector01 using the old non-virsh /usr/share/cluster/vm.sh agent i can successfully start the virtual machines: [root at PhySrv07 ~]# rg_test test /etc/cluster/cluster.conf start vm LVSDirector01 Running in test mode. Starting LVSDirector01... # xm command line: LVSDirector01 on_shutdown="destroy" on_reboot="destroy" on_crash="destroy" RGMANAGER_meta_refcnt="0" depend_mode="hard" exclusive="0" hardrecovery="0" max_restarts="0" --path="/cluster/XenDomains" restart_expire_time="0" Using config file "/cluster/XenDomains/LVSDirector01". Started domain LVSDirector01 Start of LVSDirector01 complete I get a feeling that the way I have my domUs prepared is not suited for virsh or I am missing some deps I should have met. What is required for vm.sh successfully using virsh? (I'd really rather have that for the synchronous/asynchronous nature of virsh vs xm). Thanks and best regards, Johannes From ntadmin at fi.upm.es Sun Oct 4 20:26:39 2009 From: ntadmin at fi.upm.es (Miguel Sanchez) Date: Sun, 04 Oct 2009 22:26:39 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <4AC88EDF.8040408@io-consulting.net> References: <4AC88EDF.8040408@io-consulting.net> Message-ID: <4AC904FF.5040507@fi.upm.es> jr escribi?: > Hello everybody, > I'm having trouble using the current vm.sh that uses virsh instead of > xm commands. > When i'm using the new one, the dom0s are unable to start any of my > virtual machines. > This is what i get: > > [root at PhySrv07 ~]# rg_test test /etc/cluster/cluster.conf start vm > LVSDirector01 > Running in test mode. > Starting LVSDirector01... > Hypervisor: xen > Management tool: virsh > Hypervisor URI: xen:/// > Migration URI format: xenmigr://target_host/ > Virtual machine LVSDirector01 is libvir: Xen error : Domain not found: > xenUnifiedDomainLookupByName > error: failed to get domain 'LVSDirector01' > > virsh -c xen:/// start LVSDirector01 > libvir: Xen error : Domain not found: xenUnifiedDomainLookupByName > error: failed to get domain 'LVSDirector01' > > Failed to start LVSDirector01 > > using the old non-virsh /usr/share/cluster/vm.sh agent i can > successfully start the virtual machines: > > > [root at PhySrv07 ~]# rg_test test /etc/cluster/cluster.conf start vm > LVSDirector01 > Running in test mode. > Starting LVSDirector01... > # xm command line: LVSDirector01 on_shutdown="destroy" > on_reboot="destroy" on_crash="destroy" RGMANAGER_meta_refcnt="0" > depend_mode="hard" exclusive="0" hardrecovery="0" max_restarts="0" > --path="/cluster/XenDomains" restart_expire_time="0" > Using config file "/cluster/XenDomains/LVSDirector01". > Started domain LVSDirector01 > Start of LVSDirector01 complete > > I get a feeling that the way I have my domUs prepared is not suited > for virsh or I am missing some deps I should have met. > What is required for vm.sh successfully using virsh? (I'd really > rather have that for the synchronous/asynchronous nature of virsh vs xm). > Thanks and best regards, > Johannes > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Use in your cluster configuration file. It will use xm instead of virsh. -- Miguel From johannes.russek at io-consulting.net Sun Oct 4 22:42:08 2009 From: johannes.russek at io-consulting.net (jr) Date: Mon, 05 Oct 2009 00:42:08 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <4AC904FF.5040507@fi.upm.es> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> Message-ID: <4AC924C0.10108@io-consulting.net> Miguel Sanchez wrote: >> https://www.redhat.com/mailman/listinfo/linux-cluster > Use in your cluster configuration file. It > will use xm instead of virsh. > Hello Miguel, I'm aware of that, however i would prefer to use virsh instead of xm especially due to the asynchronous nature of xm and the problems (with failed migrations for example) this creates.. Johannes > -- > Miguel > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From fdinitto at redhat.com Mon Oct 5 04:41:30 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 05 Oct 2009 06:41:30 +0200 Subject: [Linux-cluster] Cluster 3.0.3: compile error In-Reply-To: References: Message-ID: <1254717690.28555.128.camel@cerberus.int.fabbione.net> On Thu, 2009-10-01 at 07:39 -0500, David Merhar wrote: > RHEL 5.4 > kernel 2.6.31.1 > corosync 1.1.0 > openais 1.1.0 > > corosync and openais install without issue. > > ... > /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c: In function > 'device_geometry': > /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: > 'O_CLOEXEC' undeclared (first use in this function) > /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: (Each > undeclared identifier is reported only once > /root/cluster-3.0.3/gfs/gfs_mkfs/device_geometry.c:33: error: for each > function it appears in.) > > > My testing environment is vm. > > Any place to start looking? O_CLOEXEC attribute for open(2) has been introduced only in more recent glibc versions. Either you update glibc or use a more recent distribution for testing. Fabio From jakov.sosic at srce.hr Mon Oct 5 08:29:11 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Mon, 5 Oct 2009 10:29:11 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <4AC924C0.10108@io-consulting.net> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> Message-ID: <20091005102911.06881491@nb-jsosic> On Mon, 05 Oct 2009 00:42:08 +0200 jr wrote: > Hello Miguel, > I'm aware of that, however i would prefer to use virsh instead of xm > especially due to the asynchronous nature of xm and the problems > (with failed migrations for example) this creates.. What distribution are you using, and what is the version of rgmanager & cman? I'm interested in testing virsh too, because I also noticed those problems with xm (stalling while migrating). I'm on RHEL 5.3. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From gianluca.cecchi at gmail.com Mon Oct 5 10:08:43 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Mon, 5 Oct 2009 12:08:43 +0200 Subject: [Linux-cluster] info on "A processor failed" message and fencing when going to single user mode Message-ID: <561c252c0910050308h5e315dd6yfc595beaaedafd28@mail.gmail.com> Hello, 2 nodes cluster (virtfed and virtfedbis their names) with F11 x86_64 up2date as of today and without qdisk cman-3.0.2-1.fc11.x86_64 openais-1.0.1-1.fc11.x86_64 corosync-1.0.0-1.fc11.x86_64 and kernel 2.6.30.8-64.fc11.x86_64 I was in a situation where both nodes up, after virtfedbis hust restarted and starting a service Inside one of its resources there is a loop where it tests availability of a file and so it was in starting of this service, but infra ws up, as of this messages: Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] CLM CONFIGURATION CHANGE Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] New Configuration: Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.101) Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] Members Left: Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.102) Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] Members Joined: Oct 5 11:44:39 virtfed corosync[4684]: [QUORUM] This node is within the primary component and will provide service. Oct 5 11:44:39 virtfed corosync[4684]: [QUORUM] Members[1]: Oct 5 11:44:39 virtfed corosync[4684]: [QUORUM] 1 Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] CLM CONFIGURATION CHANGE Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] New Configuration: Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.101) Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] Members Left: Oct 5 11:44:39 virtfed corosync[4684]: [CLM ] Members Joined: Oct 5 11:44:39 virtfed corosync[4684]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 5 11:44:39 virtfed kernel: dlm: closing connection to node 2 Oct 5 11:44:39 virtfed corosync[4684]: [MAIN ] Completed service synchronization, ready to provide service. So now they are at this condition, reported by virtfedbis [root at virtfedbis ~]# clustat Cluster Status for kvm @ Mon Oct 5 11:49:27 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ kvm1 1 Online, rgmanager kvm2 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:DRBDNODE1 kvm1 started service:DRBDNODE2 kvm2 starting I realize that I forgot a thing so that after 10 attempts DRBDNODE2 service would not come up and so I decide to put virtfedbis in single user mode, so that I run on it shutdown 0 I would expect virtfedbis to leave cleanly the cluster, instead it is fenced and rebooted (via fence_ilo agent) On virtfed these are the messages: Oct 5 11:49:49 virtfed corosync[4684]: [TOTEM ] A processor failed, forming new configuration. Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] CLM CONFIGURATION CHANGE Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] New Configuration: Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.101) Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] Members Left: Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.102) Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] Members Joined: Oct 5 11:49:54 virtfed corosync[4684]: [QUORUM] This node is within the primary component and will provide service. Oct 5 11:49:54 virtfed corosync[4684]: [QUORUM] Members[1]: Oct 5 11:49:54 virtfed corosync[4684]: [QUORUM] 1 Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] CLM CONFIGURATION CHANGE Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] New Configuration: Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] #011r(0) ip(192.168.16.101) Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] Members Left: Oct 5 11:49:54 virtfed corosync[4684]: [CLM ] Members Joined: Oct 5 11:49:54 virtfed corosync[4684]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 5 11:49:54 virtfed corosync[4684]: [MAIN ] Completed service synchronization, ready to provide service. Oct 5 11:49:54 virtfed kernel: dlm: closing connection to node 2 Oct 5 11:49:54 virtfed fenced[4742]: fencing node kvm2 Oct 5 11:49:54 virtfed rgmanager[5496]: State change: kvm2 DOWN Oct 5 11:50:26 virtfed fenced[4742]: fence kvm2 success What I find on virtfedbis after restart in /var/log/cluster directory is this: corosync.log Oct 05 11:49:49 corosync [TOTEM ] A processor failed, forming new configuration. Oct 05 11:49:49 corosync [TOTEM ] The network interface is down. Oct 05 11:49:54 corosync [CLM ] CLM CONFIGURATION CHANGE Oct 05 11:49:54 corosync [CLM ] New Configuration: Oct 05 11:49:54 corosync [CLM ] r(0) ip(127.0.0.1) Oct 05 11:49:54 corosync [CLM ] Members Left: Oct 05 11:49:54 corosync [CLM ] r(0) ip(192.168.16.102) Oct 05 11:49:54 corosync [CLM ] Members Joined: Oct 05 11:49:54 corosync [QUORUM] This node is within the primary component and will provide service. Oct 05 11:49:54 corosync [QUORUM] Members[1]: Oct 05 11:49:54 corosync [QUORUM] 1 Oct 05 11:49:54 corosync [CLM ] CLM CONFIGURATION CHANGE Oct 05 11:49:54 corosync [CLM ] New Configuration: Oct 05 11:49:54 corosync [CLM ] r(0) ip(127.0.0.1) Oct 05 11:49:54 corosync [CLM ] Members Left: Oct 05 11:49:54 corosync [CLM ] Members Joined: Oct 05 11:49:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 05 11:49:54 corosync [CMAN ] Killing node kvm2 because it has rejoined the cluster with existing state I think there is something wrong in this behaviour.... This is a test cluster so I have no qdisk ..... Is this the cause inherent with my config that has: In general, if I do a shutdown -r now an one of the two nodes I have not thsi kind of problems..... Thanks for any insight, Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From edsonmarquezani at gmail.com Mon Oct 5 12:10:11 2009 From: edsonmarquezani at gmail.com (Edson Marquezani Filho) Date: Mon, 5 Oct 2009 09:10:11 -0300 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <20091005102911.06881491@nb-jsosic> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> Message-ID: <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> > > What distribution are you using, and what is the version of rgmanager & > cman? > > I'm interested in testing virsh too, because I also noticed those > problems with xm (stalling while migrating). I'm on RHEL 5.3. > I would like to hear from those who have some experience with vm.sh: can it be used as a resource on a service configuration for Rgmanager? I say this because all examples I have seen use it as a separated statement on cluster.conf, not within a service declaration. But, in order to put VMs up, I need to satisfy some requisites first, like LVs activations on the node. I wrote a minimal LSB-compliant script to be used as a service script, and handle VMs, then I configure a service for each VM I have, with a script and lvm-cluster resources. But, I'm considering to use vm.sh instead. Can you explain to me how it works, and how do you use it? Thank you. From daniela.anzellotti at roma1.infn.it Mon Oct 5 12:19:17 2009 From: daniela.anzellotti at roma1.infn.it (Daniela Anzellotti) Date: Mon, 05 Oct 2009 14:19:17 +0200 Subject: [Linux-cluster] openais issue In-Reply-To: <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909281641n3479f380ge68b2077ab6b0665@mail.gmail.com> <8b711df40909290845x45e2a09aif37d6a3dd301de11@mail.gmail.com> <29ae894c0909290951v11a958e2k3a1aadce7f3b88e7@mail.gmail.com> <8b711df40909291337n2f26908dt363944c6238eb9f5@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> Message-ID: <4AC9E445.3050702@roma1.infn.it> Hi all, I had a problem similar to Paras's one today: yum updated the following rpms last week and today (I had to restart the cluster) the cluster was not able to start vm: services. Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 So, after checked the vm.sh script, I add the declaration use_virsh="0" in the VM definition in the cluster.conf (as suggested by Brem, thanks!) and everything is now working again. BTW I didn't understand if the problem was caused by the new XEN version or the new openais one, thus I disabled automatic updates for both. I hope I'll not have any other bad surprise... Thank you, cheers, Daniela Paras pradhan wrote: > Yes this is very strange. I don't know what to do now. May be re > create the cluster? But not a good solution actually. > > Packages : > > Kernel: kernel-xen-2.6.18-164.el5 > OS: Full updated of CentOS 5.3 except CMAN downgraded to cman-2.0.98-1.el5 > > Other packages related to cluster suite: > > rgmanager-2.0.52-1.el5.centos > cman-2.0.98-1.el5 > xen-3.0.3-80.el5_3.3 > xen-libs-3.0.3-80.el5_3.3 > kmod-gfs-xen-0.1.31-3.el5_3.1 > kmod-gfs-xen-0.1.31-3.el5_3.1 > kmod-gfs-0.1.31-3.el5_3.1 > gfs-utils-0.1.18-1.el5 > gfs2-utils-0.1.62-1.el5 > lvm2-2.02.40-6.el5 > lvm2-cluster-2.02.40-7.el5 > openais-0.80.3-22.el5_3.9 > > Thanks! > Paras. > > > > > On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli > wrote: >> Hi Paras, >> >> Your cluster.conf file seems correct. If it is not a ntp issue, I >> don't see anything except a bug that causes this, or some prerequisite >> that is not respected. >> >> May be you could post the versions (os, kernel, packages etc...) you >> are using, someone may have hit the same issue with your versions. >> >> Brem >> >> 2009/9/30, Paras pradhan : >>> All of the nodes are synced with ntp server. So this is not the case with me. >>> >>> Thanks >>> Paras. >>> >>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>> wrote: >>>> make sure the time on the nodes is in sync, apparently when a node has too >>>> much offset, you won't see rgmanager (even though the process is running). >>>> this happened today and setting the time fixed it for me. afaicr there was >>>> no sign of this in the logs though. >>>> johannes >>>> >>>> Paras pradhan schrieb: >>>>> I don't see rgmanager . >>>>> >>>>> Here is the o/p from clustat >>>>> >>>>> [root at cvtst1 cluster]# clustat >>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>> Member Status: Quorate >>>>> >>>>> Member Name ID >>>>> Status >>>>> ------ ---- ---- >>>>> ------ >>>>> cvtst2 1 Online >>>>> cvtst1 2 Online, >>>>> Local >>>>> cvtst3 3 Online >>>>> >>>>> >>>>> Thanks >>>>> Paras. >>>>> >>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>> wrote: >>>>> >>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>> >>>>>> what gives you clustat ? >>>>>> >>>>>> If rgmanager doesn't show, check out the logs something may have gone >>>>>> wrong. >>>>>> >>>>>> >>>>>> 2009/9/29 Paras pradhan : >>>>>> >>>>>>> Change to 7 and i got this log >>>>>>> >>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>> Cluster Service Manager... >>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown complete, >>>>>>> exiting >>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster Service >>>>>>> Manager is stopped. >>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>> Manager Starting >>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service Data >>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource Rules >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource Trees >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources defined >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>> Domains >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing Services >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services Initialized >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port Opened >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: Local UP >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst2 UP >>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst3 UP >>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) Processed >>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) Processed >>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) Processed >>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events processed >>>>>>> >>>>>>> >>>>>>> Anything unusual here? >>>>>>> >>>>>>> Paras. >>>>>>> >>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>> wrote: >>>>>>> >>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>> >>>>>>>> It seems 4 is not enough. >>>>>>>> >>>>>>>> Brem >>>>>>>> >>>>>>>> >>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>> >>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>> >>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>> Cluster Service Manager... >>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown complete, >>>>>>>>> exiting >>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster Service >>>>>>>>> Manager is stopped. >>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>> Manager Starting >>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting down >>>>>>>>> Cluster Service Manager... >>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster Service >>>>>>>>> Manager is stopped. >>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>> Manager Starting >>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 shutting >>>>>>>>> down >>>>>>>>> >>>>>>>>> I do not know what the last line means. >>>>>>>>> >>>>>>>>> rgmanager version I am running is: >>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>> >>>>>>>>> I don't what has gone wrong. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Paras. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> you mean it stopped successfully on all the nodes but it is failing >>>>>>>>>> to >>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>> >>>>>>>>>> look at the following page to make rgmanager more verbose. It 'll >>>>>>>>>> help debug.... >>>>>>>>>> >>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>> >>>>>>>>>> at Logging Configuration section >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>> >>>>>>>>>>> Brem, >>>>>>>>>>> >>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i do not >>>>>>>>>>> see rgmanager running on the first node. But I do see on other 2 >>>>>>>>>>> nodes. >>>>>>>>>>> >>>>>>>>>>> Log on the first node: >>>>>>>>>>> >>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting down >>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting down >>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown complete, >>>>>>>>>>> exiting >>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster Service >>>>>>>>>>> Manager is stopped. >>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> It seems service is running , but I do not see rgmanger running >>>>>>>>>>> using clustat >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Don't know what is going on. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Paras. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Paras, >>>>>>>>>>>> >>>>>>>>>>>> Another thing, it would have been more interesting to have a start >>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>> >>>>>>>>>>>> That's why I was asking you to first stop the vm manually on all >>>>>>>>>>>> your >>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset the >>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>> >>>>>>>>>>>> If your VM is configured to autostart, this will make it start. >>>>>>>>>>>> >>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>> created >>>>>>>>>>>> DEBUG file. >>>>>>>>>>>> >>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>> >>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>> remember >>>>>>>>>>>>> well, I think I've read somewhere that when using xen you have to >>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>> >>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to manage >>>>>>>>>>>>> your >>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>> >>>>>>>>>>>>> Brem >>>>>>>>>>>>> >>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>> >>>>>>>>>>>>>> The only thing I noticed is the message after stopping the vm >>>>>>>>>>>>>> using xm >>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>> >>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>> >>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> There's a problem with the script that is called by rgmanager to >>>>>>>>>>>>>>> start >>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>> following >>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>> exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>> set -x >>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see where >>>>>>>>>>>>>>> it >>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it fails. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> No I am not manually starting not using automatic init scripts. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the guest1 on >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Service Name Owner >>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>> ------- ---- ----- >>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> vm:guest1 (none) >>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on vm >>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: Failed >>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: Recovering >>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on vm >>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: Failed >>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Have you started your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>> using xm commands out of cluster control (or maybe a thru an >>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with the >>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it with >>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat output) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes though >>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Member Name >>>>>>>>>>>>>>>>>> ID Status >>>>>>>>>>>>>>>>>> ------ ---- >>>>>>>>>>>>>>>>>> ---- ------ >>>>>>>>>>>>>>>>>> cvtst2 1 >>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>> cvtst1 2 >>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>> cvtst3 3 >>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>> Owner (Last) >>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>> ----- ------ >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>> Name ID Mem(MiB) VCPUs >>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>> Domain-0 0 3470 2 >>>>>>>>>>>>>>>>>> r----- 28939.4 >>>>>>>>>>>>>>>>>> guest1 7 511 1 >>>>>>>>>>>>>>>>>> -b---- 7727.8 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>> Name ID Mem(MiB) VCPUs >>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>> Domain-0 0 3470 2 >>>>>>>>>>>>>>>>>> r----- 31558.9 >>>>>>>>>>>>>>>>>> guest1 21 511 1 >>>>>>>>>>>>>>>>>> -b---- 7558.2 >>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> can you send an output of clustat of when the VM is running >>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped (clusvcadm >>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service is not >>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual machines. >>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm in >>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls this? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part of the >>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- - Daniela Anzellotti ------------------------------------ INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 e-mail: daniela.anzellotti at roma1.infn.it --------------------------------------------------------- From jakov.sosic at srce.hr Mon Oct 5 13:06:59 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Mon, 5 Oct 2009 15:06:59 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> Message-ID: <20091005150659.3b9496cf@pc-jsosic.srce.hr> On Mon, 5 Oct 2009 09:10:11 -0300 Edson Marquezani Filho wrote: > I would like to hear from those who have some experience with vm.sh: > can it be used as a resource on a service configuration for Rgmanager? > I say this because all examples I have seen use it as a separated > statement on cluster.conf, not within a service declaration. But, in > order to put VMs up, I need to satisfy some requisites first, like LVs > activations on the node. > > I wrote a minimal LSB-compliant script to be used as a service script, > and handle VMs, then I configure a service for each VM I have, with a > script and lvm-cluster resources. But, I'm considering to use vm.sh > instead. > > Can you explain to me how it works, and how do you use it? Currently I use in my config. I use CLVM, so that LVM uses clustered locking, so I don't need to activate the LV's, it's done automatically for me. I haven't tried inside of a service... -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From lhh at redhat.com Mon Oct 5 13:47:54 2009 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Oct 2009 09:47:54 -0400 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <4AC924C0.10108@io-consulting.net> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> Message-ID: <1254750474.14760.4138.camel@localhost.localdomain> On Mon, 2009-10-05 at 00:42 +0200, jr wrote: > Miguel Sanchez wrote: > >> https://www.redhat.com/mailman/listinfo/linux-cluster > > Use in your cluster configuration file. It > > will use xm instead of virsh. > > > > Hello Miguel, > I'm aware of that, however i would prefer to use virsh instead of xm > especially due to the asynchronous nature of xm and the problems (with > failed migrations for example) this creates.. > Johannes Assuming you have the most current vm.sh (which unbreaks "path" attribute support for 'xm' mode), there are two problems: 1. 'Virsh' does not have a --path option, so, in order for your path (/cluster/XenDomains) to work, we need to add path searching to the vm.sh file and do manual searching for [name].xml in the specified pathspec. This isn't hard to do; I have code to split up the path and check for the files, but it's not integrated yet. However... 2. 'virsh' (and/or libvirt) does not support loading Xen domain configuration files from any location except /etc/xen as far as I know. This is because virsh doesn't actually know about the underlying config format; it just sends requests to libvirtd via libvirt API. So, even if we solve (1), you will have to generate XML files for all your domains. So, the current solution as it exists today is: * Generate XML files for your domains, and * Use 'xmlfile' parameter[1]: -- Lon [1] Assumes you are using STABLE3 branch. From lhh at redhat.com Mon Oct 5 13:51:36 2009 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 05 Oct 2009 09:51:36 -0400 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> Message-ID: <1254750696.14760.4151.camel@localhost.localdomain> On Mon, 2009-10-05 at 09:10 -0300, Edson Marquezani Filho wrote: > > > > What distribution are you using, and what is the version of rgmanager & > > cman? > > > > I'm interested in testing virsh too, because I also noticed those > > problems with xm (stalling while migrating). I'm on RHEL 5.3. > > > > I would like to hear from those who have some experience with vm.sh: > can it be used as a resource on a service configuration for Rgmanager? > I say this because all examples I have seen use it as a separated > statement on cluster.conf, not within a service declaration. But, in > order to put VMs up, I need to satisfy some requisites first, like LVs > activations on the node. You can use as a child of , but you can't migrate them if you do. -- Lon From brem.belguebli at gmail.com Mon Oct 5 14:10:12 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 5 Oct 2009 16:10:12 +0200 Subject: [Linux-cluster] openais issue In-Reply-To: <4AC9E445.3050702@roma1.infn.it> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909290951v11a958e2k3a1aadce7f3b88e7@mail.gmail.com> <8b711df40909291337n2f26908dt363944c6238eb9f5@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> Message-ID: <29ae894c0910050710m3a8c85cyeb6215267b8d0c39@mail.gmail.com> Good news, the use_virsh=0 parameter is something that I have read somewhere. I don't know if it was due to a bug, or anything else, and if it is corrected. As I said to Paras, I have no expertise on Xen setups. Brem 2009/10/5, Daniela Anzellotti : > Hi all, > > I had a problem similar to Paras's one today: yum updated the following rpms > last week and today (I had to restart the cluster) the cluster was not able > to start vm: services. > > Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 > Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 > Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 > > Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 > Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 > Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 > > > So, after checked the vm.sh script, I add the declaration use_virsh="0" in > the VM definition in the cluster.conf (as suggested by Brem, thanks!) and > everything is now working again. > > > BTW I didn't understand if the problem was caused by the new XEN version or > the new openais one, thus I disabled automatic updates for both. > > I hope I'll not have any other bad surprise... > > Thank you, > cheers, > Daniela > > > > Paras pradhan wrote: > > Yes this is very strange. I don't know what to do now. May be re > > create the cluster? But not a good solution actually. > > > > Packages : > > > > Kernel: kernel-xen-2.6.18-164.el5 > > OS: Full updated of CentOS 5.3 except CMAN downgraded to cman-2.0.98-1.el5 > > > > Other packages related to cluster suite: > > > > rgmanager-2.0.52-1.el5.centos > > cman-2.0.98-1.el5 > > xen-3.0.3-80.el5_3.3 > > xen-libs-3.0.3-80.el5_3.3 > > kmod-gfs-xen-0.1.31-3.el5_3.1 > > kmod-gfs-xen-0.1.31-3.el5_3.1 > > kmod-gfs-0.1.31-3.el5_3.1 > > gfs-utils-0.1.18-1.el5 > > gfs2-utils-0.1.62-1.el5 > > lvm2-2.02.40-6.el5 > > lvm2-cluster-2.02.40-7.el5 > > openais-0.80.3-22.el5_3.9 > > > > Thanks! > > Paras. > > > > > > > > > > On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli > > wrote: > > > > > Hi Paras, > > > > > > Your cluster.conf file seems correct. If it is not a ntp issue, I > > > don't see anything except a bug that causes this, or some prerequisite > > > that is not respected. > > > > > > May be you could post the versions (os, kernel, packages etc...) you > > > are using, someone may have hit the same issue with your versions. > > > > > > Brem > > > > > > 2009/9/30, Paras pradhan : > > > > > > > All of the nodes are synced with ntp server. So this is not the case > with me. > > > > > > > > Thanks > > > > Paras. > > > > > > > > On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek > > > > wrote: > > > > > > > > > make sure the time on the nodes is in sync, apparently when a node > has too > > > > > much offset, you won't see rgmanager (even though the process is > running). > > > > > this happened today and setting the time fixed it for me. afaicr > there was > > > > > no sign of this in the logs though. > > > > > johannes > > > > > > > > > > Paras pradhan schrieb: > > > > > > > > > > > I don't see rgmanager . > > > > > > > > > > > > Here is the o/p from clustat > > > > > > > > > > > > [root at cvtst1 cluster]# clustat > > > > > > Cluster Status for test @ Tue Sep 29 15:53:33 2009 > > > > > > Member Status: Quorate > > > > > > > > > > > > Member Name > ID > > > > > > Status > > > > > > ------ ---- > ---- > > > > > > ------ > > > > > > cvtst2 1 > Online > > > > > > cvtst1 2 > Online, > > > > > > Local > > > > > > cvtst3 3 > Online > > > > > > > > > > > > > > > > > > Thanks > > > > > > Paras. > > > > > > > > > > > > On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli > > > > > > wrote: > > > > > > > > > > > > > > > > > > > It looks correct, rgmanager seems to start on all nodes > > > > > > > > > > > > > > what gives you clustat ? > > > > > > > > > > > > > > If rgmanager doesn't show, check out the logs something may have > gone > > > > > > > wrong. > > > > > > > > > > > > > > > > > > > > > 2009/9/29 Paras pradhan : > > > > > > > > > > > > > > > > > > > > > > Change to 7 and i got this log > > > > > > > > > > > > > > > > Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting > down > > > > > > > > Cluster Service Manager... > > > > > > > > Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting > down > > > > > > > > Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting > down > > > > > > > > Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown > complete, > > > > > > > > exiting > > > > > > > > Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster > Service > > > > > > > > Manager is stopped. > > > > > > > > Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource > Group > > > > > > > > Manager Starting > > > > > > > > Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading > Service Data > > > > > > > > Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading > Resource Rules > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules > loaded > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building > Resource Trees > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources > defined > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading > Failover > > > > > > > > Domains > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains > defined > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events > defined > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing > Services > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services > Initialized > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port > Opened > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: > Local UP > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: > cvtst2 UP > > > > > > > > Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: > cvtst3 UP > > > > > > > > Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) > Processed > > > > > > > > Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) > Processed > > > > > > > > Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) > Processed > > > > > > > > Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events > processed > > > > > > > > > > > > > > > > > > > > > > > > Anything unusual here? > > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I use log_level=7 to have more debugging info. > > > > > > > > > > > > > > > > > > It seems 4 is not enough. > > > > > > > > > > > > > > > > > > Brem > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/29, Paras pradhan : > > > > > > > > > > > > > > > > > > > > > > > > > > > > Withe log_level of 3 I got only this > > > > > > > > > > > > > > > > > > > > Sep 29 10:31:31 cvtst1 rgmanager: [7170]: > Shutting down > > > > > > > > > > Cluster Service Manager... > > > > > > > > > > Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting > down > > > > > > > > > > Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown > complete, > > > > > > > > > > exiting > > > > > > > > > > Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster > Service > > > > > > > > > > Manager is stopped. > > > > > > > > > > Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource > Group > > > > > > > > > > Manager Starting > > > > > > > > > > Sep 29 10:39:06 cvtst1 rgmanager: [10327]: > Shutting down > > > > > > > > > > Cluster Service Manager... > > > > > > > > > > Sep 29 10:39:16 cvtst1 rgmanager: [10327]: > Cluster Service > > > > > > > > > > Manager is stopped. > > > > > > > > > > Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource > Group > > > > > > > > > > Manager Starting > > > > > > > > > > Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 > shutting > > > > > > > > > > down > > > > > > > > > > > > > > > > > > > > I do not know what the last line means. > > > > > > > > > > > > > > > > > > > > rgmanager version I am running is: > > > > > > > > > > rgmanager-2.0.52-1.el5.centos > > > > > > > > > > > > > > > > > > > > I don't what has gone wrong. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > you mean it stopped successfully on all the nodes but it > is failing > > > > > > > > > > > to > > > > > > > > > > > start only on node cvtst1 ? > > > > > > > > > > > > > > > > > > > > > > look at the following page to make rgmanager more > verbose. It 'll > > > > > > > > > > > help debug.... > > > > > > > > > > > > > > > > > > > > > > > http://sources.redhat.com/cluster/wiki/RGManager > > > > > > > > > > > > > > > > > > > > > > at Logging Configuration section > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/29 Paras pradhan : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brem, > > > > > > > > > > > > > > > > > > > > > > > > When I try to restart rgmanager on all the nodes, this > time i do not > > > > > > > > > > > > see rgmanager running on the first node. But I do see > on other 2 > > > > > > > > > > > > nodes. > > > > > > > > > > > > > > > > > > > > > > > > Log on the first node: > > > > > > > > > > > > > > > > > > > > > > > > Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: > Resource Group > > > > > > > > > > > > Manager Starting > > > > > > > > > > > > Sep 28 18:17:29 cvtst1 rgmanager: [24627]: > Shutting down > > > > > > > > > > > > Cluster Service Manager... > > > > > > > > > > > > Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: > Shutting down > > > > > > > > > > > > Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: > Shutdown complete, > > > > > > > > > > > > exiting > > > > > > > > > > > > Sep 28 18:17:39 cvtst1 rgmanager: [24627]: > Cluster Service > > > > > > > > > > > > Manager is stopped. > > > > > > > > > > > > Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: > Resource Group > > > > > > > > > > > > Manager Starting > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > > It seems service is running , but I do not see > rgmanger running > > > > > > > > > > > > using clustat > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Don't know what is going on. > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Paras, > > > > > > > > > > > > > > > > > > > > > > > > > > Another thing, it would have been more interesting > to have a start > > > > > > > > > > > > > DEBUG not a stop. > > > > > > > > > > > > > > > > > > > > > > > > > > That's why I was asking you to first stop the vm > manually on all > > > > > > > > > > > > > your > > > > > > > > > > > > > nodes, stop eventually rgmanager on all the nodes to > reset the > > > > > > > > > > > > > potential wrong states you may have, restart > rgmanager. > > > > > > > > > > > > > > > > > > > > > > > > > > If your VM is configured to autostart, this will > make it start. > > > > > > > > > > > > > > > > > > > > > > > > > > It should normally fail (as it does now). Send out > your newly > > > > > > > > > > > > > created > > > > > > > > > > > > > DEBUG file. > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/29 brem belguebli : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Paras, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know the xen/cluster combination well, but > if I do > > > > > > > > > > > > > > remember > > > > > > > > > > > > > > well, I think I've read somewhere that when using > xen you have to > > > > > > > > > > > > > > declare the use_virsh=0 key in the VM definition > in the > > > > > > > > > > > > > > cluster.conf. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This would make rgmanager use xm commands instead > of virsh > > > > > > > > > > > > > > The DEBUG output shows clearly that you are using > virsh to manage > > > > > > > > > > > > > > your > > > > > > > > > > > > > > VM instead of xm commands. > > > > > > > > > > > > > > Check out the RH docs about virtualization > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm not a 100% sure about that, I may be > completely wrong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brem > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/28 Paras pradhan : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The only thing I noticed is the message after > stopping the vm > > > > > > > > > > > > > > > using xm > > > > > > > > > > > > > > > in all nodes and starting using clusvcadm is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "Virtual machine guest1 is blocked" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The whole DEBUG file is attached. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's a problem with the script that is > called by rgmanager to > > > > > > > > > > > > > > > > start > > > > > > > > > > > > > > > > the VM, I don't know what causes it > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > May be you should try something like : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) stop the VM on all nodes with xm commands > > > > > > > > > > > > > > > > 2) edit the /usr/share/cluster/vm.sh script > and add the > > > > > > > > > > > > > > > > following > > > > > > > > > > > > > > > > lines (after the #!/bin/bash ): > > > > > > > > > > > > > > > > exec >/tmp/DEBUG 2>&1 > > > > > > > > > > > > > > > > set -x > > > > > > > > > > > > > > > > 3) start the VM with clusvcadm -e vm:guest1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It should fail as it did before. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > edit the the /tmp/DEBUG file and you will be > able to see where > > > > > > > > > > > > > > > > it > > > > > > > > > > > > > > > > fails (it may generate a lot of debug) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4) remove the debug lines from > /usr/share/cluster/vm.sh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Post the DEBUG file if you're not able to see > where it fails. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brem > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/26 Paras pradhan > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No I am not manually starting not using > automatic init scripts. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I started the vm using: clusvcadm -e > vm:guest1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have just stopped using clusvcadm -s > vm:guest1. For few > > > > > > > > > > > > > > > > > seconds it > > > > > > > > > > > > > > > > > says guest1 started . But after a while I > can see the guest1 on > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > three nodes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > clustat says: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Service Name > Owner > > > > > > > > > > > > > > > > > (Last) > > > > > > > > > > > > > > > > > State > > > > > > > > > > > > > > > > > ------- ---- > ----- > > > > > > > > > > > > > > > > > ------ > > > > > > > > > > > > > > > > > ----- > > > > > > > > > > > > > > > > > vm:guest1 > (none) > > > > > > > > > > > > > > > > > > stopped > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > But I can see the vm from xm li. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is what I can see from the log: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: > start on vm > > > > > > > > > > > > > > > > > "guest1" > > > > > > > > > > > > > > > > > returned 1 (generic error) > > > > > > > > > > > > > > > > > Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: > #68: Failed > > > > > > > > > > > > > > > > > to start > > > > > > > > > > > > > > > > > vm:guest1; return value: 1 > > > > > > > > > > > > > > > > > Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: > Stopping > > > > > > > > > > > > > > > > > service vm:guest1 > > > > > > > > > > > > > > > > > Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: > Service > > > > > > > > > > > > > > > > > vm:guest1 is > > > > > > > > > > > > > > > > > recovering > > > > > > > > > > > > > > > > > Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: > Recovering > > > > > > > > > > > > > > > > > failed > > > > > > > > > > > > > > > > > service vm:guest1 > > > > > > > > > > > > > > > > > Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: > start on vm > > > > > > > > > > > > > > > > > "guest1" > > > > > > > > > > > > > > > > > returned 1 (generic error) > > > > > > > > > > > > > > > > > Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: > #68: Failed > > > > > > > > > > > > > > > > > to start > > > > > > > > > > > > > > > > > vm:guest1; return value: 1 > > > > > > > > > > > > > > > > > Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: > Stopping > > > > > > > > > > > > > > > > > service vm:guest1 > > > > > > > > > > > > > > > > > Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: > Service > > > > > > > > > > > > > > > > > vm:guest1 is > > > > > > > > > > > > > > > > > recovering > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 25, 2009 at 5:07 PM, brem > belguebli > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Have you started your VM via rgmanager > (clusvcadm -e > > > > > > > > > > > > > > > > > > vm:guest1) or > > > > > > > > > > > > > > > > > > using xm commands out of cluster control > (or maybe a thru an > > > > > > > > > > > > > > > > > > automatic init script ?) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When clustered, you should never be > starting services > > > > > > > > > > > > > > > > > > (manually or > > > > > > > > > > > > > > > > > > thru automatic init script) out of cluster > control > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The thing would be to stop your vm on all > the nodes with the > > > > > > > > > > > > > > > > > > adequate > > > > > > > > > > > > > > > > > > xm command (not using xen myself) and try > to start it with > > > > > > > > > > > > > > > > > > clusvcadm. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then see if it is started on all nodes > (send clustat output) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/25 Paras pradhan > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok. Please see below. my vm is running > on all nodes though > > > > > > > > > > > > > > > > > > > clustat > > > > > > > > > > > > > > > > > > > says it is stopped. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > [root at cvtst1 ~]# clustat > > > > > > > > > > > > > > > > > > > Cluster Status for test @ Fri Sep 25 > 16:52:34 2009 > > > > > > > > > > > > > > > > > > > Member Status: Quorate > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Member Name > > > > > > > > > > > > > > > > > > > ID Status > > > > > > > > > > > > > > > > > > > ------ ---- > > > > > > > > > > > > > > > > > > > ---- ------ > > > > > > > > > > > > > > > > > > > cvtst2 > 1 > > > > > > > > > > > > > > > > > > > Online, rgmanager > > > > > > > > > > > > > > > > > > > cvtst1 > 2 > > > > > > > > > > > > > > > > > > > Online, > > > > > > > > > > > > > > > > > > > Local, rgmanager > > > > > > > > > > > > > > > > > > > cvtst3 > 3 > > > > > > > > > > > > > > > > > > > Online, rgmanager > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Service Name > > > > > > > > > > > > > > > > > > > Owner (Last) > > > > > > > > > > > > > > > > > > > > State > > > > > > > > > > > > > > > > > > > ------- ---- > > > > > > > > > > > > > > > > > > > ----- ------ > > > > > > > > > > > > > > > > > > > > ----- > > > > > > > > > > > > > > > > > > > vm:guest1 > > > > > > > > > > > > > > > > > > > (none) > > > > > > > > > > > > > > > > > > > > stopped > > > > > > > > > > > > > > > > > > > [root at cvtst1 ~]# > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > > o/p of xm li on cvtst1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > [root at cvtst1 ~]# xm li > > > > > > > > > > > > > > > > > > > Name > ID Mem(MiB) VCPUs > > > > > > > > > > > > > > > > > > > State Time(s) > > > > > > > > > > > > > > > > > > > Domain-0 > 0 3470 2 > > > > > > > > > > > > > > > > > > > r----- 28939.4 > > > > > > > > > > > > > > > > > > > guest1 > 7 511 1 > > > > > > > > > > > > > > > > > > > -b---- 7727.8 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > o/p of xm li on cvtst2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > [root at cvtst2 ~]# xm li > > > > > > > > > > > > > > > > > > > Name > ID Mem(MiB) VCPUs > > > > > > > > > > > > > > > > > > > State Time(s) > > > > > > > > > > > > > > > > > > > Domain-0 > 0 3470 2 > > > > > > > > > > > > > > > > > > > r----- 31558.9 > > > > > > > > > > > > > > > > > > > guest1 > 21 511 1 > > > > > > > > > > > > > > > > > > > -b---- 7558.2 > > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 25, 2009 at 4:22 PM, brem > belguebli > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It looks like no. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > can you send an output of clustat of > when the VM is running > > > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > > > multiple nodes at the same time? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And by the way, another one after > having stopped (clusvcadm > > > > > > > > > > > > > > > > > > > > -s vm:guest1) ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2009/9/25 Paras pradhan > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anyone having issue as mine? Virtual > machine service is not > > > > > > > > > > > > > > > > > > > > > being > > > > > > > > > > > > > > > > > > > > > properly handled by the cluster. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 21, 2009 at 9:55 AM, > Paras pradhan > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok.. here is my cluster.conf file > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > [root at cvtst1 cluster]# more > cluster.conf > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > config_version="9" name="test"> > > > > > > > > > > > > > > > > > > > > > > post_fail_delay="0" > > > > > > > > > > > > > > > > > > > > > > post_join_delay="3"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst2" nodeid="1" > > > > > > > > > > > > > > > > > > > > > > votes="1"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst1" nodeid="2" > > > > > > > > > > > > > > > > > > > > > > votes="1"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst3" nodeid="3" > > > > > > > > > > > > > > > > > > > > > > votes="1"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nofailback="0" ordered="1" > restricted="0"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst2" priority="3"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst1" priority="1"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="cvtst3" priority="2"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > domain="myfd1" > > > > > > > > > > > > > > > > > > > > > > exclusive="0" max_restarts="0" > > > > > > > > > > > > > > > > > > > > > > name="guest1" path="/vms" > recovery="r > > > > > > > > > > > > > > > > > > > > > > estart" restart_expire_time="0"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [root at cvtst1 cluster]# > > > > > > > > > > > > > > > > > > > > > > ------ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > Paras. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Sep 20, 2009 at 9:44 AM, > Volker Dormeyer > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Sep 18, 2009 at > 05:08:57PM -0500, > > > > > > > > > > > > > > > > > > > > > > > Paras pradhan > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am using cluster suite for > HA of xen virtual machines. > > > > > > > > > > > > > > > > > > > > > > > > Now I am > > > > > > > > > > > > > > > > > > > > > > > > having another problem. When I > start the my xen vm in > > > > > > > > > > > > > > > > > > > > > > > > one node, it > > > > > > > > > > > > > > > > > > > > > > > > also starts on other nodes. > Which daemon controls this? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is usually done bei > clurgmgrd (which is part of the > > > > > > > > > > > > > > > > > > > > > > > rgmanager > > > > > > > > > > > > > > > > > > > > > > > package). To me, this sounds > like a configuration > > > > > > > > > > > > > > > > > > > > > > > problem. Maybe, > > > > > > > > > > > > > > > > > > > > > > > you can post your cluster.conf? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > Volker > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Linux-cluster mailing list > > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Linux-cluster mailing list > > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Linux-cluster mailing list > > > > > > > > Linux-cluster at redhat.com > > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Linux-cluster mailing list > > > > > > > Linux-cluster at redhat.com > > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Linux-cluster mailing list > > > > > > Linux-cluster at redhat.com > > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > > > > -- > > > > > Linux-cluster mailing list > > > > > Linux-cluster at redhat.com > > > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > > > > -- > > > > Linux-cluster mailing list > > > > Linux-cluster at redhat.com > > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > - Daniela Anzellotti ------------------------------------ > INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 > e-mail: daniela.anzellotti at roma1.infn.it > --------------------------------------------------------- > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From brem.belguebli at gmail.com Mon Oct 5 14:43:21 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 5 Oct 2009 16:43:21 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <1254750696.14760.4151.camel@localhost.localdomain> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> <1254750696.14760.4151.camel@localhost.localdomain> Message-ID: <29ae894c0910050743xbf25d1brb0b2f2ce74180ebd@mail.gmail.com> Hi, To give an example of setup that did "surprisingly" work like a charm out of the box (RHEL 5.4 KVM) - 3 nodes cluster (RHEL 5.4 x86_64) - 2 50 GB SAN LUN's (partitionned p1= 100 MB, p2=49.9 GB) /dev/mpath/mpath4 (mpath4p1, mpath4p2) /dev/mpath/mpath5 (mpath5p1, mpath5p2) - 3 mirrored LV's lvolVM1, lvolVM2 and lvolVM3 on mpath4p2/mpath5p2 and mpath4p1 as mirrorlog - cmirror to maintain mirror log across the cluster LV's are activated "shared", ie active on all nodes, no exclusive activation being used. Each VM using a LV as virtual disk device (VM XML conf file): <-- for VM1 Each VM being defined in the cluster.conf with no hierarchical dependency on anything: Failover and live migration work fine VM's must be defined on all nodes (after creation on one node, copy the VM xml conf file to the other nodes and issue a virsh define /Path/to/the/xml file) The only thing that may look unsecure is the fact that the LV's are active on all the nodes, a problem could happen if someone manually started the VM's on some nodes while already active on another one. I'll try the setup with exclusive activation and check if live migration still works (I doubt that). Brem 2009/10/5, Lon Hohberger : > On Mon, 2009-10-05 at 09:10 -0300, Edson Marquezani Filho wrote: > > > > > > What distribution are you using, and what is the version of rgmanager & > > > cman? > > > > > > I'm interested in testing virsh too, because I also noticed those > > > problems with xm (stalling while migrating). I'm on RHEL 5.3. > > > > > > > I would like to hear from those who have some experience with vm.sh: > > can it be used as a resource on a service configuration for Rgmanager? > > I say this because all examples I have seen use it as a separated > > statement on cluster.conf, not within a service declaration. But, in > > order to put VMs up, I need to satisfy some requisites first, like LVs > > activations on the node. > > You can use as a child of , but you can't migrate them if > you do. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From edsonmarquezani at gmail.com Mon Oct 5 19:43:01 2009 From: edsonmarquezani at gmail.com (Edson Marquezani Filho) Date: Mon, 5 Oct 2009 16:43:01 -0300 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <29ae894c0910050743xbf25d1brb0b2f2ce74180ebd@mail.gmail.com> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> <1254750696.14760.4151.camel@localhost.localdomain> <29ae894c0910050743xbf25d1brb0b2f2ce74180ebd@mail.gmail.com> Message-ID: <2fc5f090910051243s3ab6e4b5k8bc64b4a7c4fbea9@mail.gmail.com> On Mon, Oct 5, 2009 at 11:43, brem belguebli wrote: > Hi, > > To give an example of setup that did "surprisingly" work like a charm > out of the box ?(RHEL 5.4 KVM) > > - ?3 nodes cluster (RHEL 5.4 x86_64) > - ?2 ?50 GB SAN LUN's (partitionned p1= 100 MB, p2=49.9 GB) > ? ?/dev/mpath/mpath4 (mpath4p1, mpath4p2) > ? ?/dev/mpath/mpath5 (mpath5p1, mpath5p2) > - ?3 mirrored LV's ?lvolVM1, lvolVM2 and lvolVM3 on mpath4p2/mpath5p2 > and mpath4p1 as mirrorlog I don't know about this mirroring feature. How does it work and why do you use it ? > - cmirror to maintain mirror log across the cluster > LV's are activated "shared", ie active on all nodes, ?no exclusive > activation being used. > > Each VM using a LV as virtual disk device (VM XML conf file): > > ? > ? ? ? <-- for VM1 > > Each VM being defined in the cluster.conf with no hierarchical > dependency on anything: > > > ? ? ? ? ? ? ? use_virsh="1"/> > ? ? ? ? ? ? ? use_virsh="1"/> > ? ? ? ? ? ? ? use_virsh="1"/> > > > Failover and live migration work fine I tought that live migration without any access control on LVs would cause some corruption on file systems. But, I guess that even without exclusive activation, I should use CLVM, should I ? > VM's must be defined on all nodes (after creation on one node, copy > the VM xml conf file to the other nodes and issue a virsh define > /Path/to/the/xml file) I'm not using virsh because I have just learned the old-school way to control VMs with xm. When I knew about that virsh tool, I had already modified config files manually. Would be better that I recreate all of them using libvirt infrastructure? > The only thing that may look unsecure is the fact that the LV's are > active on all the nodes, a problem could happen if someone manually > started the VM's on some nodes while already active on another one. That's the point who made me ask for help here sometime ago, and what more concerns me. Rafael Miranda told me about his lvm-cluster resource script. So, I developed a simple script, that performs start, stop, and status operations. For stop, it saves the VM to a stat file. For start, it either restores the VM if there is a stat file for ir, or creates it if there is not. Status just return sucess if the VM appears on xm list, or failure if not. Stats files should be saved on a GFS directory, mounted on both nodes. Then, I configure each VM as a service, with its lvm-cluster and script resources. So, relocating a "vm service" will look like a semi-live migration, if I can call like this. =) Actually, it saves the VM in shared directory and restores it on the other node in a little time, without reseting it. It will look just like the VM had stopped for a little time and came back. But now I'm thinking if I have tried to reinvent the whell. =) > I'll try the setup with exclusive activation and check if live > migration still works (I doubt that). > > Brem > What do you think about this? Thank you. From brem.belguebli at gmail.com Mon Oct 5 20:28:08 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 5 Oct 2009 22:28:08 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <2fc5f090910051243s3ab6e4b5k8bc64b4a7c4fbea9@mail.gmail.com> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> <1254750696.14760.4151.camel@localhost.localdomain> <29ae894c0910050743xbf25d1brb0b2f2ce74180ebd@mail.gmail.com> <2fc5f090910051243s3ab6e4b5k8bc64b4a7c4fbea9@mail.gmail.com> Message-ID: <29ae894c0910051328k96b4eddt1f6ff88a6dc18d85@mail.gmail.com> 2009/10/5 Edson Marquezani Filho : > On Mon, Oct 5, 2009 at 11:43, brem belguebli wrote: >> Hi, >> >> To give an example of setup that did "surprisingly" work like a charm >> out of the box ?(RHEL 5.4 KVM) >> >> - ?3 nodes cluster (RHEL 5.4 x86_64) >> - ?2 ?50 GB SAN LUN's (partitionned p1= 100 MB, p2=49.9 GB) >> ? ?/dev/mpath/mpath4 (mpath4p1, mpath4p2) >> ? ?/dev/mpath/mpath5 (mpath5p1, mpath5p2) >> - ?3 mirrored LV's ?lvolVM1, lvolVM2 and lvolVM3 on mpath4p2/mpath5p2 >> and mpath4p1 as mirrorlog > > I don't know about this mirroring feature. How does it work and why do > you use it ? > I'm trying to build a setup across 2 sites which doesn't bring nothing very important to the current topic, it's just my setup ;-) >> - cmirror to maintain mirror log across the cluster >> LV's are activated "shared", ie active on all nodes, ?no exclusive >> activation being used. >> >> Each VM using a LV as virtual disk device (VM XML conf file): >> >> ? >> ? ? ? <-- for VM1 >> >> Each VM being defined in the cluster.conf with no hierarchical >> dependency on anything: >> >> >> ? ? ? ? ? ? ? > use_virsh="1"/> >> ? ? ? ? ? ? ? > use_virsh="1"/> >> ? ? ? ? ? ? ? > use_virsh="1"/> >> >> >> Failover and live migration work fine > > I tought that live migration without any access control on LVs would > cause some corruption on file systems. But, I guess that even without > exclusive activation, I should use CLVM, should I ? > there is no filesystem involved here, just raw devices with a boot sector and so on... >> VM's must be defined on all nodes (after creation on one node, copy >> the VM xml conf file to the other nodes and issue a virsh define >> /Path/to/the/xml file) > > I'm not using virsh because I have just learned the old-school way to > control VMs with xm. When I knew about that virsh tool, I had already > modified config files manually. > Would be better that I recreate all of them using libvirt infrastructure? > As mentioned above, my setup is KVM based, not xen, I just cannot use xm things... >> The only thing that may look unsecure is the fact that the LV's are >> active on all the nodes, a problem could happen if someone manually >> started the VM's on some nodes while already active on another one. > > That's the point who made me ask for help here sometime ago, and what > more concerns me. cf above, my point about the fact that there is no filesystem involved .. > Rafael Miranda told me about his lvm-cluster resource script. So, I > developed a simple script, that performs start, stop, and status > operations. For stop, it saves the VM to a stat file. For start, it > either restores the VM if there is a stat file for ir, or creates it > if there is not. Status just return sucess if the VM appears on xm > list, or failure if not. Stats files should be saved on a GFS > directory, mounted on both nodes. > Rafael's resource works fine, but the thing with VM's is that one wants to still benefit from live migration capabilities, etc... As stated Lon, if the VM is part of a given service, live migration won't be possible. > Then, I configure each VM as a service, with its lvm-cluster and > script resources. > > So, relocating a "vm service" will look like a semi-live migration, if > I can call like this. =) Actually, it saves the VM in shared directory > and restores it on the other node in a little time, without reseting > it. It will look just like the VM had stopped for a little time and > came back. > > But now I'm thinking if I have tried to reinvent the whell. =) > Well, saving on disk can take time, I'm not sure it's the way to take for migration purposes. The faster it is the better it will be. Your approach would require a VM freeze, than dump on shared disk (time proportionnal to VM size) than the VM wakup on the other node. Imagine the consequences of a clock jump on a server VM... >> I'll try the setup with exclusive activation and check if live >> migration still works (I doubt that). >> I was a bit optimistic, thinking that libvirt/virsh was doing a vgchange -a y if the "disk source dev" is a LV, in fact it doesn't. It doesn't seem to be possible to benefit from live migration when exclusive activation is "active". Or if anyone has the thing to instruct libvirt/virsh to execute a script prior to accessing the storage... >> Brem >> > > What do you think about this? > > Thank you. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From nicolas.ferre at univ-provence.fr Tue Oct 6 07:26:32 2009 From: nicolas.ferre at univ-provence.fr (=?ISO-8859-1?Q?Nicolas_Ferr=E9?=) Date: Tue, 06 Oct 2009 09:26:32 +0200 Subject: [Linux-cluster] Large load due to Message-ID: <4ACAF128.7010205@univ-provence.fr> Hi, Few days ago, I reported a problem with a heavy load on our cluster login node. I can reproduce the problem, ps axl | grep " D" gives: 0 2001 2368 1 16 0 66088 1572 just_s Ds ? 0:00 -bash 0 0 9854 23584 18 0 63260 800 pipe_w S+ pts/6 0:00 grep D 4 0 11972 1 15 0 66216 1632 just_s Ds ? 0:00 -bash 0 0 21576 1 17 0 73916 824 just_s D ? 0:00 ls --color=tty /scratch/lctmm 0 2001 26178 1 15 0 66092 1600 just_s Ds ? 0:00 -bash Actually the problem is related to various attempts to access the directory /scratch/lctmm which simply freezes the login bash and turns it to a zombie status. Unfortunately, I can't kill these zombies and I have to reboot the login node. The /scratch volume is gfs2. I can access to other /scratch subdirectories. Has someone already experienced such a problem? -- Nicolas Ferre' Laboratoire Chimie Provence Universite' de Provence - France Tel: +33 491282733 http://sites.univ-provence.fr/lcp-ct From jakov.sosic at srce.hr Tue Oct 6 08:49:54 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 6 Oct 2009 10:49:54 +0200 Subject: [Linux-cluster] vm.sh with and without virsh In-Reply-To: <1254750696.14760.4151.camel@localhost.localdomain> References: <4AC88EDF.8040408@io-consulting.net> <4AC904FF.5040507@fi.upm.es> <4AC924C0.10108@io-consulting.net> <20091005102911.06881491@nb-jsosic> <2fc5f090910050510wd514a13m8739c0920938c3c0@mail.gmail.com> <1254750696.14760.4151.camel@localhost.localdomain> Message-ID: <20091006104954.2ecfa596@nb-jsosic> On Mon, 05 Oct 2009 09:51:36 -0400 Lon Hohberger wrote: > You can use as a child of , but you can't migrate them > if you do. I'm still on RHEL 5.3. Can I take vm.sh from 5.4, and use it in 5.3, because I would like to use virsh because sometimes VM in cluster hangs in "migrating" state, and I would like to try the virsh based vm.sh to try and avoid that bug. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From pradhanparas at gmail.com Tue Oct 6 15:31:40 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 6 Oct 2009 10:31:40 -0500 Subject: [Linux-cluster] openais issue In-Reply-To: <4AC9E445.3050702@roma1.infn.it> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909290951v11a958e2k3a1aadce7f3b88e7@mail.gmail.com> <8b711df40909291337n2f26908dt363944c6238eb9f5@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> Message-ID: <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> So you mean your cluster is running fine with the CMAN cman-2.0.115-1.el5.x86_64 ? Which version of openais are you running? Thanks Paras. On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti wrote: > Hi all, > > I had a problem similar to Paras's one today: yum updated the following rpms > last week and today (I had to restart the cluster) the cluster was not able > to start vm: services. > > Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 > Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 > Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 > > Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 > Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 > Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 > > > So, after checked the vm.sh script, I add the declaration use_virsh="0" in > the VM definition in the cluster.conf (as suggested by Brem, thanks!) and > everything is now working again. > > > BTW I didn't understand if the problem was caused by the new XEN version or > the new openais one, thus I disabled automatic updates for both. > > I hope I'll not have any other bad surprise... > > Thank you, > cheers, > Daniela > > > Paras pradhan wrote: >> >> Yes this is very strange. I don't know what to do now. May be re >> create the cluster? But not a good solution actually. >> >> Packages : >> >> Kernel: kernel-xen-2.6.18-164.el5 >> OS: Full updated of CentOS 5.3 except CMAN downgraded to cman-2.0.98-1.el5 >> >> Other packages related to cluster suite: >> >> rgmanager-2.0.52-1.el5.centos >> cman-2.0.98-1.el5 >> xen-3.0.3-80.el5_3.3 >> xen-libs-3.0.3-80.el5_3.3 >> kmod-gfs-xen-0.1.31-3.el5_3.1 >> kmod-gfs-xen-0.1.31-3.el5_3.1 >> kmod-gfs-0.1.31-3.el5_3.1 >> gfs-utils-0.1.18-1.el5 >> gfs2-utils-0.1.62-1.el5 >> lvm2-2.02.40-6.el5 >> lvm2-cluster-2.02.40-7.el5 >> openais-0.80.3-22.el5_3.9 >> >> Thanks! >> Paras. >> >> >> >> >> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >> wrote: >>> >>> Hi Paras, >>> >>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>> don't see anything except a bug that causes this, or some prerequisite >>> that is not respected. >>> >>> May be you could post the versions (os, kernel, packages etc...) you >>> are using, someone may have hit the same issue with your versions. >>> >>> Brem >>> >>> 2009/9/30, Paras pradhan : >>>> >>>> All of the nodes are synced with ntp server. So this is not the case >>>> with me. >>>> >>>> Thanks >>>> Paras. >>>> >>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>> wrote: >>>>> >>>>> make sure the time on the nodes is in sync, apparently when a node has >>>>> too >>>>> much offset, you won't see rgmanager (even though the process is >>>>> running). >>>>> this happened today and setting the time fixed it for me. afaicr there >>>>> was >>>>> no sign of this in the logs though. >>>>> johannes >>>>> >>>>> Paras pradhan schrieb: >>>>>> >>>>>> I don't see rgmanager . >>>>>> >>>>>> Here is the o/p from clustat >>>>>> >>>>>> [root at cvtst1 cluster]# clustat >>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>> Member Status: Quorate >>>>>> >>>>>> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID >>>>>> Status >>>>>> ?------ ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ---- >>>>>> ------ >>>>>> ?cvtst2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 Online >>>>>> ?cvtst1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 Online, >>>>>> Local >>>>>> ?cvtst3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 3 Online >>>>>> >>>>>> >>>>>> Thanks >>>>>> Paras. >>>>>> >>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>> wrote: >>>>>> >>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>> >>>>>>> what gives you clustat ? >>>>>>> >>>>>>> If rgmanager doesn't show, check out the logs something may have gone >>>>>>> wrong. >>>>>>> >>>>>>> >>>>>>> 2009/9/29 Paras pradhan : >>>>>>> >>>>>>>> Change to 7 and i got this log >>>>>>>> >>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>>> Cluster Service Manager... >>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown complete, >>>>>>>> exiting >>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster Service >>>>>>>> Manager is stopped. >>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>> Manager Starting >>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service Data >>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource >>>>>>>> Rules >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource >>>>>>>> Trees >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources defined >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>>> Domains >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>> Services >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services Initialized >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port Opened >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: Local >>>>>>>> UP >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst2 >>>>>>>> UP >>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst3 >>>>>>>> UP >>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>> Processed >>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>> Processed >>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>> Processed >>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events processed >>>>>>>> >>>>>>>> >>>>>>>> Anything unusual here? >>>>>>>> >>>>>>>> Paras. >>>>>>>> >>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>> >>>>>>>>> It seems 4 is not enough. >>>>>>>>> >>>>>>>>> Brem >>>>>>>>> >>>>>>>>> >>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>> >>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>> >>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>>> Cluster Service Manager... >>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>> complete, >>>>>>>>>> exiting >>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster Service >>>>>>>>>> Manager is stopped. >>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>>> Manager Starting >>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting down >>>>>>>>>> Cluster Service Manager... >>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>> Service >>>>>>>>>> Manager is stopped. >>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>>> Manager Starting >>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>> shutting >>>>>>>>>> down >>>>>>>>>> >>>>>>>>>> I do not know what the last line means. >>>>>>>>>> >>>>>>>>>> rgmanager version I am running is: >>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>> >>>>>>>>>> I don't what has gone wrong. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Paras. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>> failing >>>>>>>>>>> to >>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>> >>>>>>>>>>> look at the following page ?to make rgmanager more verbose. It >>>>>>>>>>> 'll >>>>>>>>>>> help debug.... >>>>>>>>>>> >>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>> >>>>>>>>>>> at Logging Configuration section >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>> >>>>>>>>>>>> Brem, >>>>>>>>>>>> >>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i do >>>>>>>>>>>> not >>>>>>>>>>>> see rgmanager running on the first node. But I do see on other 2 >>>>>>>>>>>> nodes. >>>>>>>>>>>> >>>>>>>>>>>> Log on the first node: >>>>>>>>>>>> >>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource Group >>>>>>>>>>>> Manager Starting >>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>> down >>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting down >>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>> complete, >>>>>>>>>>>> exiting >>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>> Service >>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource Group >>>>>>>>>>>> Manager Starting >>>>>>>>>>>> >>>>>>>>>>>> - >>>>>>>>>>>> It seems service is running , ?but I do not see rgmanger running >>>>>>>>>>>> using clustat >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Paras. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Paras, >>>>>>>>>>>>> >>>>>>>>>>>>> Another thing, it would have been more interesting to have a >>>>>>>>>>>>> start >>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>> >>>>>>>>>>>>> That's why I was asking you to first stop the vm manually on >>>>>>>>>>>>> all >>>>>>>>>>>>> your >>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset the >>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>> >>>>>>>>>>>>> If your VM is configured to autostart, this will make it start. >>>>>>>>>>>>> >>>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>>> created >>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>> >>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>> remember >>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you have >>>>>>>>>>>>>> to >>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>> manage >>>>>>>>>>>>>> your >>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brem >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The only thing I noticed is the message after stopping the vm >>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>> ?exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>> ?set -x >>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>> ?Owner >>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>> ?----- >>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: Recovering >>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Have you started ?your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>> using xm commands out of cluster control ?(or maybe a thru >>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it with >>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ?Member Name >>>>>>>>>>>>>>>>>>> ? ?ID ? Status >>>>>>>>>>>>>>>>>>> ?------ ---- >>>>>>>>>>>>>>>>>>> ? ?---- ------ >>>>>>>>>>>>>>>>>>> ?cvtst2 >>>>>>>>>>>>>>>>>>> ?1 >>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>> ?cvtst1 >>>>>>>>>>>>>>>>>>> ? 2 >>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>> ?cvtst3 >>>>>>>>>>>>>>>>>>> ? 3 >>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>> ?Owner (Last) >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>> ?----- ------ >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>> r----- ?28939.4 >>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 7 ? ? ?511 >>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>> -b---- ? 7727.8 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>> r----- ?31558.9 >>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?21 ? ? ?511 >>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>> -b---- ? 7558.2 >>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> can you send an output of clustat ?of when the VM is >>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service is >>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ? ? ?>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>> ? ? ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm >>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls >>>>>>>>>>>>>>>>>>>>>>>> ?this? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part of >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > - Daniela Anzellotti ------------------------------------ > ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 > ?e-mail: daniela.anzellotti at roma1.infn.it > --------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From daniela.anzellotti at roma1.infn.it Tue Oct 6 15:48:10 2009 From: daniela.anzellotti at roma1.infn.it (Daniela Anzellotti) Date: Tue, 06 Oct 2009 17:48:10 +0200 Subject: [Linux-cluster] openais issue In-Reply-To: <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909290951v11a958e2k3a1aadce7f3b88e7@mail.gmail.com> <8b711df40909291337n2f26908dt363944c6238eb9f5@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> Message-ID: <4ACB66BA.8040002@roma1.infn.it> Hi Paras, yes. At least it looks so... We have a cluster of two nodes + a quorum disk (it's not configured as a "two-node cluster") They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and openais-0.80.6-8.el5.x86_64 cman-2.0.115-1.el5.x86_64 rgmanager-2.0.52-1.el5.x86_64 The XEN VMs access the disk as simple block devices. Disks are on a SAN, configured with Clustered LVM. xen-3.0.3-94.el5_4.1.x86_64 xen-libs-3.0.3-94.el5_4.1.x86_64 VM configuration files are as the following name = "www1" uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" memory = 1024 maxmem = 1024 vcpus = 1 bootloader = "/usr/bin/pygrub" vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ "phy:/dev/vg_cluster/www1.swap,xvdb,w" ] vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] on_poweroff = "destroy" on_reboot = "restart" on_crash = "restart" extra = "xencons=tty0 console=tty0" I changed in /etc/cluster/cluster.conf all the VM directive from to Rebooted the cluster nodes and it started working again... As i said I hope I'll not have any other bad surprise (I tested a VM migration and it is working too), but at least cluster it's working now (it was not able to start a VM, before)! Ciao Daniela Paras pradhan wrote: > So you mean your cluster is running fine with the CMAN > cman-2.0.115-1.el5.x86_64 ? > > Which version of openais are you running? > > Thanks > Paras. > > > On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti > wrote: >> Hi all, >> >> I had a problem similar to Paras's one today: yum updated the following rpms >> last week and today (I had to restart the cluster) the cluster was not able >> to start vm: services. >> >> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >> >> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >> >> >> So, after checked the vm.sh script, I add the declaration use_virsh="0" in >> the VM definition in the cluster.conf (as suggested by Brem, thanks!) and >> everything is now working again. >> >> >> BTW I didn't understand if the problem was caused by the new XEN version or >> the new openais one, thus I disabled automatic updates for both. >> >> I hope I'll not have any other bad surprise... >> >> Thank you, >> cheers, >> Daniela >> >> >> Paras pradhan wrote: >>> Yes this is very strange. I don't know what to do now. May be re >>> create the cluster? But not a good solution actually. >>> >>> Packages : >>> >>> Kernel: kernel-xen-2.6.18-164.el5 >>> OS: Full updated of CentOS 5.3 except CMAN downgraded to cman-2.0.98-1.el5 >>> >>> Other packages related to cluster suite: >>> >>> rgmanager-2.0.52-1.el5.centos >>> cman-2.0.98-1.el5 >>> xen-3.0.3-80.el5_3.3 >>> xen-libs-3.0.3-80.el5_3.3 >>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>> kmod-gfs-0.1.31-3.el5_3.1 >>> gfs-utils-0.1.18-1.el5 >>> gfs2-utils-0.1.62-1.el5 >>> lvm2-2.02.40-6.el5 >>> lvm2-cluster-2.02.40-7.el5 >>> openais-0.80.3-22.el5_3.9 >>> >>> Thanks! >>> Paras. >>> >>> >>> >>> >>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>> wrote: >>>> Hi Paras, >>>> >>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>> don't see anything except a bug that causes this, or some prerequisite >>>> that is not respected. >>>> >>>> May be you could post the versions (os, kernel, packages etc...) you >>>> are using, someone may have hit the same issue with your versions. >>>> >>>> Brem >>>> >>>> 2009/9/30, Paras pradhan : >>>>> All of the nodes are synced with ntp server. So this is not the case >>>>> with me. >>>>> >>>>> Thanks >>>>> Paras. >>>>> >>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>> wrote: >>>>>> make sure the time on the nodes is in sync, apparently when a node has >>>>>> too >>>>>> much offset, you won't see rgmanager (even though the process is >>>>>> running). >>>>>> this happened today and setting the time fixed it for me. afaicr there >>>>>> was >>>>>> no sign of this in the logs though. >>>>>> johannes >>>>>> >>>>>> Paras pradhan schrieb: >>>>>>> I don't see rgmanager . >>>>>>> >>>>>>> Here is the o/p from clustat >>>>>>> >>>>>>> [root at cvtst1 cluster]# clustat >>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>> Member Status: Quorate >>>>>>> >>>>>>> Member Name ID >>>>>>> Status >>>>>>> ------ ---- ---- >>>>>>> ------ >>>>>>> cvtst2 1 Online >>>>>>> cvtst1 2 Online, >>>>>>> Local >>>>>>> cvtst3 3 Online >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Paras. >>>>>>> >>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>> wrote: >>>>>>> >>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>> >>>>>>>> what gives you clustat ? >>>>>>>> >>>>>>>> If rgmanager doesn't show, check out the logs something may have gone >>>>>>>> wrong. >>>>>>>> >>>>>>>> >>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>> >>>>>>>>> Change to 7 and i got this log >>>>>>>>> >>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>>>> Cluster Service Manager... >>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown complete, >>>>>>>>> exiting >>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster Service >>>>>>>>> Manager is stopped. >>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>> Manager Starting >>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service Data >>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource >>>>>>>>> Rules >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource >>>>>>>>> Trees >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources defined >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>>>> Domains >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>> Services >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services Initialized >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port Opened >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: Local >>>>>>>>> UP >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst2 >>>>>>>>> UP >>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: cvtst3 >>>>>>>>> UP >>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>> Processed >>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>> Processed >>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>> Processed >>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events processed >>>>>>>>> >>>>>>>>> >>>>>>>>> Anything unusual here? >>>>>>>>> >>>>>>>>> Paras. >>>>>>>>> >>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>> >>>>>>>>>> It seems 4 is not enough. >>>>>>>>>> >>>>>>>>>> Brem >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>> >>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>> >>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>> complete, >>>>>>>>>>> exiting >>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster Service >>>>>>>>>>> Manager is stopped. >>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting down >>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>> Service >>>>>>>>>>> Manager is stopped. >>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>> shutting >>>>>>>>>>> down >>>>>>>>>>> >>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>> >>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>> >>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Paras. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>> failing >>>>>>>>>>>> to >>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>> >>>>>>>>>>>> look at the following page to make rgmanager more verbose. It >>>>>>>>>>>> 'll >>>>>>>>>>>> help debug.... >>>>>>>>>>>> >>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>> >>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>> >>>>>>>>>>>>> Brem, >>>>>>>>>>>>> >>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i do >>>>>>>>>>>>> not >>>>>>>>>>>>> see rgmanager running on the first node. But I do see on other 2 >>>>>>>>>>>>> nodes. >>>>>>>>>>>>> >>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>> >>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting down >>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>> complete, >>>>>>>>>>>>> exiting >>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> >>>>>>>>>>>>> - >>>>>>>>>>>>> It seems service is running , but I do not see rgmanger running >>>>>>>>>>>>> using clustat >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Paras. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another thing, it would have been more interesting to have a >>>>>>>>>>>>>> start >>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually on >>>>>>>>>>>>>> all >>>>>>>>>>>>>> your >>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset the >>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If your VM is configured to autostart, this will make it start. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>>>> created >>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you have >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping the vm >>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>> exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>> set -x >>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>> Owner >>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: Recovering >>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Have you started your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>> using xm commands out of cluster control (or maybe a thru >>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it with >>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Member Name >>>>>>>>>>>>>>>>>>>> ID Status >>>>>>>>>>>>>>>>>>>> ------ ---- >>>>>>>>>>>>>>>>>>>> ---- ------ >>>>>>>>>>>>>>>>>>>> cvtst2 >>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>> cvtst1 >>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>> cvtst3 >>>>>>>>>>>>>>>>>>>> 3 >>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>>>> Owner (Last) >>>>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>>>> ----- ------ >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>> r----- 28939.4 >>>>>>>>>>>>>>>>>>>> guest1 7 511 >>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>> -b---- 7727.8 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>> r----- 31558.9 >>>>>>>>>>>>>>>>>>>> guest1 21 511 >>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>> -b---- 7558.2 >>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> can you send an output of clustat of when the VM is >>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service is >>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm >>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls >>>>>>>>>>>>>>>>>>>>>>>>> this? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part of >>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> - Daniela Anzellotti ------------------------------------ >> INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >> e-mail: daniela.anzellotti at roma1.infn.it >> --------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- - Daniela Anzellotti ------------------------------------ INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 e-mail: daniela.anzellotti at roma1.infn.it --------------------------------------------------------- From pradhanparas at gmail.com Tue Oct 6 16:20:03 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 6 Oct 2009 11:20:03 -0500 Subject: [Linux-cluster] openais issue In-Reply-To: <4ACB66BA.8040002@roma1.infn.it> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> <4ACB66BA.8040002@roma1.infn.it> Message-ID: <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> Adding use_virsh=0 works great. Now I do not have my vm starting at all the nodes. This is a good fix. Thanks .. The only problem left is I do not see rgmanager running on my node1 and the clustat of node2 and node3 is reporting vm as migrating o/p Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:guest1 cvtst1 migrating [root at cvtst3 vms]# Thanks Paras. On Tue, Oct 6, 2009 at 10:48 AM, Daniela Anzellotti wrote: > Hi Paras, > > yes. At least it looks so... > > We have a cluster of two nodes + a quorum disk (it's not configured as a > "two-node cluster") > > They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and > > ?openais-0.80.6-8.el5.x86_64 > ?cman-2.0.115-1.el5.x86_64 > ?rgmanager-2.0.52-1.el5.x86_64 > > The XEN VMs access the disk as simple block devices. > Disks are on a SAN, configured with Clustered LVM. > > ?xen-3.0.3-94.el5_4.1.x86_64 > ?xen-libs-3.0.3-94.el5_4.1.x86_64 > > VM configuration files are as the following > > ?name = "www1" > ?uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" > ?memory = 1024 > ?maxmem = 1024 > ?vcpus = 1 > ?bootloader = "/usr/bin/pygrub" > ?vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] > ?disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ > ?"phy:/dev/vg_cluster/www1.swap,xvdb,w" ] > ?vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] > ?on_poweroff = "destroy" > ?on_reboot = "restart" > ?on_crash = "restart" > ?extra = "xencons=tty0 console=tty0" > > > I changed in /etc/cluster/cluster.conf all the VM directive from > > ? ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> > > to > > ? ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> > > > Rebooted the cluster nodes and it started working again... > > As i said I hope I'll not have any other bad surprise (I tested a VM > migration and it is working too), but at least cluster it's working now (it > was not able to start a VM, before)! > > Ciao > Daniela > > > Paras pradhan wrote: >> >> So you mean your cluster is running fine with the CMAN >> cman-2.0.115-1.el5.x86_64 ? >> >> Which version of openais are you running? >> >> Thanks >> Paras. >> >> >> On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti >> wrote: >>> >>> Hi all, >>> >>> I had a problem similar to Paras's one today: yum updated the following >>> rpms >>> last week and today (I had to restart the cluster) the cluster was not >>> able >>> to start vm: services. >>> >>> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >>> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >>> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >>> >>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >>> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >>> >>> >>> So, after checked the vm.sh script, I add the declaration use_virsh="0" >>> in >>> the VM definition in the cluster.conf (as suggested by Brem, thanks!) and >>> everything is now working again. >>> >>> >>> BTW I didn't understand if the problem was caused by the new XEN version >>> or >>> the new openais one, thus I disabled automatic updates for both. >>> >>> I hope I'll not have any other bad surprise... >>> >>> Thank you, >>> cheers, >>> Daniela >>> >>> >>> Paras pradhan wrote: >>>> >>>> Yes this is very strange. I don't know what to do now. May be re >>>> create the cluster? But not a good solution actually. >>>> >>>> Packages : >>>> >>>> Kernel: kernel-xen-2.6.18-164.el5 >>>> OS: Full updated of CentOS 5.3 except CMAN downgraded to >>>> cman-2.0.98-1.el5 >>>> >>>> Other packages related to cluster suite: >>>> >>>> rgmanager-2.0.52-1.el5.centos >>>> cman-2.0.98-1.el5 >>>> xen-3.0.3-80.el5_3.3 >>>> xen-libs-3.0.3-80.el5_3.3 >>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>> kmod-gfs-0.1.31-3.el5_3.1 >>>> gfs-utils-0.1.18-1.el5 >>>> gfs2-utils-0.1.62-1.el5 >>>> lvm2-2.02.40-6.el5 >>>> lvm2-cluster-2.02.40-7.el5 >>>> openais-0.80.3-22.el5_3.9 >>>> >>>> Thanks! >>>> Paras. >>>> >>>> >>>> >>>> >>>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>>> wrote: >>>>> >>>>> Hi Paras, >>>>> >>>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>>> don't see anything except a bug that causes this, or some prerequisite >>>>> that is not respected. >>>>> >>>>> May be you could post the versions (os, kernel, packages etc...) you >>>>> are using, someone may have hit the same issue with your versions. >>>>> >>>>> Brem >>>>> >>>>> 2009/9/30, Paras pradhan : >>>>>> >>>>>> All of the nodes are synced with ntp server. So this is not the case >>>>>> with me. >>>>>> >>>>>> Thanks >>>>>> Paras. >>>>>> >>>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>>> wrote: >>>>>>> >>>>>>> make sure the time on the nodes is in sync, apparently when a node >>>>>>> has >>>>>>> too >>>>>>> much offset, you won't see rgmanager (even though the process is >>>>>>> running). >>>>>>> this happened today and setting the time fixed it for me. afaicr >>>>>>> there >>>>>>> was >>>>>>> no sign of this in the logs though. >>>>>>> johannes >>>>>>> >>>>>>> Paras pradhan schrieb: >>>>>>>> >>>>>>>> I don't see rgmanager . >>>>>>>> >>>>>>>> Here is the o/p from clustat >>>>>>>> >>>>>>>> [root at cvtst1 cluster]# clustat >>>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>>> Member Status: Quorate >>>>>>>> >>>>>>>> ?Member Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ID >>>>>>>> Status >>>>>>>> ?------ ---- >>>>>>>> ---- >>>>>>>> ------ >>>>>>>> ?cvtst2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 Online >>>>>>>> ?cvtst1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 >>>>>>>> Online, >>>>>>>> Local >>>>>>>> ?cvtst3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 3 Online >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> Paras. >>>>>>>> >>>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>>> wrote: >>>>>>>> >>>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>>> >>>>>>>>> what gives you clustat ? >>>>>>>>> >>>>>>>>> If rgmanager doesn't show, check out the logs something may have >>>>>>>>> gone >>>>>>>>> wrong. >>>>>>>>> >>>>>>>>> >>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>> >>>>>>>>>> Change to 7 and i got this log >>>>>>>>>> >>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>>>>> Cluster Service Manager... >>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown >>>>>>>>>> complete, >>>>>>>>>> exiting >>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster >>>>>>>>>> Service >>>>>>>>>> Manager is stopped. >>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>>> Manager Starting >>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service >>>>>>>>>> Data >>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource >>>>>>>>>> Rules >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource >>>>>>>>>> Trees >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources >>>>>>>>>> defined >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>>>>> Domains >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>>> Services >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services >>>>>>>>>> Initialized >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port >>>>>>>>>> Opened >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>> Local >>>>>>>>>> UP >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>> cvtst2 >>>>>>>>>> UP >>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>> cvtst3 >>>>>>>>>> UP >>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>>> Processed >>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>>> Processed >>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>>> Processed >>>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events >>>>>>>>>> processed >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Anything unusual here? >>>>>>>>>> >>>>>>>>>> Paras. >>>>>>>>>> >>>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>>> >>>>>>>>>>> It seems 4 is not enough. >>>>>>>>>>> >>>>>>>>>>> Brem >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>>> >>>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>>> >>>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>>> complete, >>>>>>>>>>>> exiting >>>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster >>>>>>>>>>>> Service >>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>>>>> Manager Starting >>>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting >>>>>>>>>>>> down >>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>>> Service >>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>>>>> Manager Starting >>>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>>> shutting >>>>>>>>>>>> down >>>>>>>>>>>> >>>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>>> >>>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>>> >>>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Paras. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>>> failing >>>>>>>>>>>>> to >>>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>>> >>>>>>>>>>>>> look at the following page ?to make rgmanager more verbose. It >>>>>>>>>>>>> 'll >>>>>>>>>>>>> help debug.... >>>>>>>>>>>>> >>>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>>> >>>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>>> >>>>>>>>>>>>>> Brem, >>>>>>>>>>>>>> >>>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i >>>>>>>>>>>>>> do >>>>>>>>>>>>>> not >>>>>>>>>>>>>> see rgmanager running on the first node. But I do see on other >>>>>>>>>>>>>> 2 >>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource >>>>>>>>>>>>>> Group >>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>>> down >>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting >>>>>>>>>>>>>> down >>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>>> complete, >>>>>>>>>>>>>> exiting >>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>>> Service >>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource >>>>>>>>>>>>>> Group >>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> It seems service is running , ?but I do not see rgmanger >>>>>>>>>>>>>> running >>>>>>>>>>>>>> using clustat >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Another thing, it would have been more interesting to have a >>>>>>>>>>>>>>> start >>>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually on >>>>>>>>>>>>>>> all >>>>>>>>>>>>>>> your >>>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If your VM is configured to autostart, this will make it >>>>>>>>>>>>>>> start. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>>>>> created >>>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you >>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping the >>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>>> ?exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>>> ?set -x >>>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>> ?Owner >>>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>> ?----- >>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- >>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stopped >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>> Recovering >>>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Have you started ?your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>>> using xm commands out of cluster control ?(or maybe a >>>>>>>>>>>>>>>>>>>> thru >>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it >>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ?Member Name >>>>>>>>>>>>>>>>>>>>> ? ID ? Status >>>>>>>>>>>>>>>>>>>>> ?------ ---- >>>>>>>>>>>>>>>>>>>>> ? ---- ------ >>>>>>>>>>>>>>>>>>>>> ?cvtst2 >>>>>>>>>>>>>>>>>>>>> ?1 >>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>> ?cvtst1 >>>>>>>>>>>>>>>>>>>>> ?2 >>>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>>> ?cvtst3 >>>>>>>>>>>>>>>>>>>>> ?3 >>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>>>> ?Owner (Last) >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >>>>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>>>> ?----- ------ >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ----- >>>>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? stopped >>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>> r----- ?28939.4 >>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 7 ? ? ?511 >>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>> -b---- ? 7727.8 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>> r----- ?31558.9 >>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?21 ? ? ?511 >>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>> -b---- ? 7558.2 >>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> can you send an output of clustat ?of when the VM is >>>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service >>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> name="test"> >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>> post_fail_delay="0" >>>>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm >>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls >>>>>>>>>>>>>>>>>>>>>>>>>> ?this? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part >>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> - Daniela Anzellotti ------------------------------------ >>> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>> ?e-mail: daniela.anzellotti at roma1.infn.it >>> --------------------------------------------------------- >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > - Daniela Anzellotti ------------------------------------ > ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 > ?e-mail: daniela.anzellotti at roma1.infn.it > --------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From daniela.anzellotti at roma1.infn.it Tue Oct 6 18:01:35 2009 From: daniela.anzellotti at roma1.infn.it (Daniela Anzellotti) Date: Tue, 06 Oct 2009 20:01:35 +0200 Subject: [Linux-cluster] openais issue In-Reply-To: <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> <4ACB66BA.8040002@roma1.infn.it> <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> Message-ID: <4ACB85FF.2010601@roma1.infn.it> Hi Paras, did you reboot all the cluster nodes? I needed a complete reboot (actually I was angry enough to switch everything off and on again): a restart of the cluster suite was not enough. As far as I understood, since the cluster was not able to bring a VM in a good running state, it decided that the VM has to migrate to another node... and at the end all the VMs was trying to migrate from one node to another with the result that I had all VMs starting on all the cluster nodes. Restarting the cluster suite didn't kill a lot of processes that was stubbornly trying to migrate virtual machines... Daniela Paras pradhan wrote: > Adding use_virsh=0 works great. Now I do not have my vm starting at > all the nodes. This is a good fix. Thanks .. > > The only problem left is I do not see rgmanager running on my node1 > and the clustat of node2 and node3 is reporting vm as migrating > > o/p > > Service Name Owner (Last) > State > ------- ---- ----- ------ > ----- > vm:guest1 cvtst1 > migrating > [root at cvtst3 vms]# > > > Thanks > Paras. > > On Tue, Oct 6, 2009 at 10:48 AM, Daniela Anzellotti > wrote: >> Hi Paras, >> >> yes. At least it looks so... >> >> We have a cluster of two nodes + a quorum disk (it's not configured as a >> "two-node cluster") >> >> They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and >> >> openais-0.80.6-8.el5.x86_64 >> cman-2.0.115-1.el5.x86_64 >> rgmanager-2.0.52-1.el5.x86_64 >> >> The XEN VMs access the disk as simple block devices. >> Disks are on a SAN, configured with Clustered LVM. >> >> xen-3.0.3-94.el5_4.1.x86_64 >> xen-libs-3.0.3-94.el5_4.1.x86_64 >> >> VM configuration files are as the following >> >> name = "www1" >> uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" >> memory = 1024 >> maxmem = 1024 >> vcpus = 1 >> bootloader = "/usr/bin/pygrub" >> vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] >> disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ >> "phy:/dev/vg_cluster/www1.swap,xvdb,w" ] >> vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] >> on_poweroff = "destroy" >> on_reboot = "restart" >> on_crash = "restart" >> extra = "xencons=tty0 console=tty0" >> >> >> I changed in /etc/cluster/cluster.conf all the VM directive from >> >> > migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >> >> to >> >> > migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >> >> >> Rebooted the cluster nodes and it started working again... >> >> As i said I hope I'll not have any other bad surprise (I tested a VM >> migration and it is working too), but at least cluster it's working now (it >> was not able to start a VM, before)! >> >> Ciao >> Daniela >> >> >> Paras pradhan wrote: >>> So you mean your cluster is running fine with the CMAN >>> cman-2.0.115-1.el5.x86_64 ? >>> >>> Which version of openais are you running? >>> >>> Thanks >>> Paras. >>> >>> >>> On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti >>> wrote: >>>> Hi all, >>>> >>>> I had a problem similar to Paras's one today: yum updated the following >>>> rpms >>>> last week and today (I had to restart the cluster) the cluster was not >>>> able >>>> to start vm: services. >>>> >>>> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >>>> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >>>> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >>>> >>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >>>> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >>>> >>>> >>>> So, after checked the vm.sh script, I add the declaration use_virsh="0" >>>> in >>>> the VM definition in the cluster.conf (as suggested by Brem, thanks!) and >>>> everything is now working again. >>>> >>>> >>>> BTW I didn't understand if the problem was caused by the new XEN version >>>> or >>>> the new openais one, thus I disabled automatic updates for both. >>>> >>>> I hope I'll not have any other bad surprise... >>>> >>>> Thank you, >>>> cheers, >>>> Daniela >>>> >>>> >>>> Paras pradhan wrote: >>>>> Yes this is very strange. I don't know what to do now. May be re >>>>> create the cluster? But not a good solution actually. >>>>> >>>>> Packages : >>>>> >>>>> Kernel: kernel-xen-2.6.18-164.el5 >>>>> OS: Full updated of CentOS 5.3 except CMAN downgraded to >>>>> cman-2.0.98-1.el5 >>>>> >>>>> Other packages related to cluster suite: >>>>> >>>>> rgmanager-2.0.52-1.el5.centos >>>>> cman-2.0.98-1.el5 >>>>> xen-3.0.3-80.el5_3.3 >>>>> xen-libs-3.0.3-80.el5_3.3 >>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>> kmod-gfs-0.1.31-3.el5_3.1 >>>>> gfs-utils-0.1.18-1.el5 >>>>> gfs2-utils-0.1.62-1.el5 >>>>> lvm2-2.02.40-6.el5 >>>>> lvm2-cluster-2.02.40-7.el5 >>>>> openais-0.80.3-22.el5_3.9 >>>>> >>>>> Thanks! >>>>> Paras. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>>>> wrote: >>>>>> Hi Paras, >>>>>> >>>>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>>>> don't see anything except a bug that causes this, or some prerequisite >>>>>> that is not respected. >>>>>> >>>>>> May be you could post the versions (os, kernel, packages etc...) you >>>>>> are using, someone may have hit the same issue with your versions. >>>>>> >>>>>> Brem >>>>>> >>>>>> 2009/9/30, Paras pradhan : >>>>>>> All of the nodes are synced with ntp server. So this is not the case >>>>>>> with me. >>>>>>> >>>>>>> Thanks >>>>>>> Paras. >>>>>>> >>>>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>>>> wrote: >>>>>>>> make sure the time on the nodes is in sync, apparently when a node >>>>>>>> has >>>>>>>> too >>>>>>>> much offset, you won't see rgmanager (even though the process is >>>>>>>> running). >>>>>>>> this happened today and setting the time fixed it for me. afaicr >>>>>>>> there >>>>>>>> was >>>>>>>> no sign of this in the logs though. >>>>>>>> johannes >>>>>>>> >>>>>>>> Paras pradhan schrieb: >>>>>>>>> I don't see rgmanager . >>>>>>>>> >>>>>>>>> Here is the o/p from clustat >>>>>>>>> >>>>>>>>> [root at cvtst1 cluster]# clustat >>>>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>>>> Member Status: Quorate >>>>>>>>> >>>>>>>>> Member Name ID >>>>>>>>> Status >>>>>>>>> ------ ---- >>>>>>>>> ---- >>>>>>>>> ------ >>>>>>>>> cvtst2 1 Online >>>>>>>>> cvtst1 2 >>>>>>>>> Online, >>>>>>>>> Local >>>>>>>>> cvtst3 3 Online >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Paras. >>>>>>>>> >>>>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>>>> >>>>>>>>>> what gives you clustat ? >>>>>>>>>> >>>>>>>>>> If rgmanager doesn't show, check out the logs something may have >>>>>>>>>> gone >>>>>>>>>> wrong. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>> >>>>>>>>>>> Change to 7 and i got this log >>>>>>>>>>> >>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown >>>>>>>>>>> complete, >>>>>>>>>>> exiting >>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster >>>>>>>>>>> Service >>>>>>>>>>> Manager is stopped. >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service >>>>>>>>>>> Data >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource >>>>>>>>>>> Rules >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource >>>>>>>>>>> Trees >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources >>>>>>>>>>> defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>>>>>> Domains >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>>>> Services >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services >>>>>>>>>>> Initialized >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port >>>>>>>>>>> Opened >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> Local >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> cvtst2 >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> cvtst3 >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events >>>>>>>>>>> processed >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Anything unusual here? >>>>>>>>>>> >>>>>>>>>>> Paras. >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>>>> >>>>>>>>>>>> It seems 4 is not enough. >>>>>>>>>>>> >>>>>>>>>>>> Brem >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>>>> >>>>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>>>> >>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>>>> complete, >>>>>>>>>>>>> exiting >>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>>>> shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> >>>>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>>>> >>>>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>>>> >>>>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Paras. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>>>> failing >>>>>>>>>>>>>> to >>>>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> look at the following page to make rgmanager more verbose. It >>>>>>>>>>>>>> 'll >>>>>>>>>>>>>> help debug.... >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>>>> >>>>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Brem, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i >>>>>>>>>>>>>>> do >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> see rgmanager running on the first node. But I do see on other >>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>>>> complete, >>>>>>>>>>>>>>> exiting >>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> It seems service is running , but I do not see rgmanger >>>>>>>>>>>>>>> running >>>>>>>>>>>>>>> using clustat >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Another thing, it would have been more interesting to have a >>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually on >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If your VM is configured to autostart, this will make it >>>>>>>>>>>>>>>> start. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>>>>>> created >>>>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you >>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping the >>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>>>> exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>>>> set -x >>>>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>>>> Owner >>>>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>> Recovering >>>>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Have you started your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>>>> using xm commands out of cluster control (or maybe a >>>>>>>>>>>>>>>>>>>>> thru >>>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it >>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Member Name >>>>>>>>>>>>>>>>>>>>>> ID Status >>>>>>>>>>>>>>>>>>>>>> ------ ---- >>>>>>>>>>>>>>>>>>>>>> ---- ------ >>>>>>>>>>>>>>>>>>>>>> cvtst2 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>> cvtst1 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>>>> cvtst3 >>>>>>>>>>>>>>>>>>>>>> 3 >>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>>>>>> Owner (Last) >>>>>>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>>>>>> ----- ------ >>>>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> r----- 28939.4 >>>>>>>>>>>>>>>>>>>>>> guest1 7 511 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> -b---- 7727.8 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> r----- 31558.9 >>>>>>>>>>>>>>>>>>>>>> guest1 21 511 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> -b---- 7558.2 >>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> can you send an output of clustat of when the VM is >>>>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service >>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="test"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> post_fail_delay="0" >>>>>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls >>>>>>>>>>>>>>>>>>>>>>>>>>> this? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part >>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> - Daniela Anzellotti ------------------------------------ >>>> INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>>> e-mail: daniela.anzellotti at roma1.infn.it >>>> --------------------------------------------------------- >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> - Daniela Anzellotti ------------------------------------ >> INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >> e-mail: daniela.anzellotti at roma1.infn.it >> --------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- - Daniela Anzellotti ------------------------------------ INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 e-mail: daniela.anzellotti at roma1.infn.it --------------------------------------------------------- From daniela.anzellotti at roma1.infn.it Tue Oct 6 17:59:48 2009 From: daniela.anzellotti at roma1.infn.it (Daniela Anzellotti) Date: Tue, 06 Oct 2009 19:59:48 +0200 Subject: [Linux-cluster] openais issue In-Reply-To: <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <29ae894c0909291344l49a2a810t33582eb6c3932810@mail.gmail.com> <8b711df40909291354w55f92097wcdef691d0b239dee@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> <4ACB66BA.8040002@roma1.infn.it> <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> Message-ID: <4ACB8594.6090004@roma1.infn.it> Hi Paras, did you reboot all the cluster nodes? I needed a complete reboot (actually I was angry enough to switch everything off and on again): a restart of the cluster suite was not enough. As far as I understood, since the cluster was not able to bring a VM in a good running state, it decided that the VM has to migrate to another node... and at the end all the VMs was trying to migrate from one node to another with the result that I had all VMs starting on all the cluster nodes. Restarting the cluster suite didn't kill a lot of processes that was stubbornly trying to migrate virtual machines... Daniela Paras pradhan wrote: > Adding use_virsh=0 works great. Now I do not have my vm starting at > all the nodes. This is a good fix. Thanks .. > > The only problem left is I do not see rgmanager running on my node1 > and the clustat of node2 and node3 is reporting vm as migrating > > o/p > > Service Name Owner (Last) > State > ------- ---- ----- ------ > ----- > vm:guest1 cvtst1 > migrating > [root at cvtst3 vms]# > > > Thanks > Paras. > > On Tue, Oct 6, 2009 at 10:48 AM, Daniela Anzellotti > wrote: >> Hi Paras, >> >> yes. At least it looks so... >> >> We have a cluster of two nodes + a quorum disk (it's not configured as a >> "two-node cluster") >> >> They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and >> >> openais-0.80.6-8.el5.x86_64 >> cman-2.0.115-1.el5.x86_64 >> rgmanager-2.0.52-1.el5.x86_64 >> >> The XEN VMs access the disk as simple block devices. >> Disks are on a SAN, configured with Clustered LVM. >> >> xen-3.0.3-94.el5_4.1.x86_64 >> xen-libs-3.0.3-94.el5_4.1.x86_64 >> >> VM configuration files are as the following >> >> name = "www1" >> uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" >> memory = 1024 >> maxmem = 1024 >> vcpus = 1 >> bootloader = "/usr/bin/pygrub" >> vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] >> disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ >> "phy:/dev/vg_cluster/www1.swap,xvdb,w" ] >> vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] >> on_poweroff = "destroy" >> on_reboot = "restart" >> on_crash = "restart" >> extra = "xencons=tty0 console=tty0" >> >> >> I changed in /etc/cluster/cluster.conf all the VM directive from >> >> > migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >> >> to >> >> > migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >> >> >> Rebooted the cluster nodes and it started working again... >> >> As i said I hope I'll not have any other bad surprise (I tested a VM >> migration and it is working too), but at least cluster it's working now (it >> was not able to start a VM, before)! >> >> Ciao >> Daniela >> >> >> Paras pradhan wrote: >>> So you mean your cluster is running fine with the CMAN >>> cman-2.0.115-1.el5.x86_64 ? >>> >>> Which version of openais are you running? >>> >>> Thanks >>> Paras. >>> >>> >>> On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti >>> wrote: >>>> Hi all, >>>> >>>> I had a problem similar to Paras's one today: yum updated the following >>>> rpms >>>> last week and today (I had to restart the cluster) the cluster was not >>>> able >>>> to start vm: services. >>>> >>>> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >>>> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >>>> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >>>> >>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >>>> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >>>> >>>> >>>> So, after checked the vm.sh script, I add the declaration use_virsh="0" >>>> in >>>> the VM definition in the cluster.conf (as suggested by Brem, thanks!) and >>>> everything is now working again. >>>> >>>> >>>> BTW I didn't understand if the problem was caused by the new XEN version >>>> or >>>> the new openais one, thus I disabled automatic updates for both. >>>> >>>> I hope I'll not have any other bad surprise... >>>> >>>> Thank you, >>>> cheers, >>>> Daniela >>>> >>>> >>>> Paras pradhan wrote: >>>>> Yes this is very strange. I don't know what to do now. May be re >>>>> create the cluster? But not a good solution actually. >>>>> >>>>> Packages : >>>>> >>>>> Kernel: kernel-xen-2.6.18-164.el5 >>>>> OS: Full updated of CentOS 5.3 except CMAN downgraded to >>>>> cman-2.0.98-1.el5 >>>>> >>>>> Other packages related to cluster suite: >>>>> >>>>> rgmanager-2.0.52-1.el5.centos >>>>> cman-2.0.98-1.el5 >>>>> xen-3.0.3-80.el5_3.3 >>>>> xen-libs-3.0.3-80.el5_3.3 >>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>> kmod-gfs-0.1.31-3.el5_3.1 >>>>> gfs-utils-0.1.18-1.el5 >>>>> gfs2-utils-0.1.62-1.el5 >>>>> lvm2-2.02.40-6.el5 >>>>> lvm2-cluster-2.02.40-7.el5 >>>>> openais-0.80.3-22.el5_3.9 >>>>> >>>>> Thanks! >>>>> Paras. >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>>>> wrote: >>>>>> Hi Paras, >>>>>> >>>>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>>>> don't see anything except a bug that causes this, or some prerequisite >>>>>> that is not respected. >>>>>> >>>>>> May be you could post the versions (os, kernel, packages etc...) you >>>>>> are using, someone may have hit the same issue with your versions. >>>>>> >>>>>> Brem >>>>>> >>>>>> 2009/9/30, Paras pradhan : >>>>>>> All of the nodes are synced with ntp server. So this is not the case >>>>>>> with me. >>>>>>> >>>>>>> Thanks >>>>>>> Paras. >>>>>>> >>>>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>>>> wrote: >>>>>>>> make sure the time on the nodes is in sync, apparently when a node >>>>>>>> has >>>>>>>> too >>>>>>>> much offset, you won't see rgmanager (even though the process is >>>>>>>> running). >>>>>>>> this happened today and setting the time fixed it for me. afaicr >>>>>>>> there >>>>>>>> was >>>>>>>> no sign of this in the logs though. >>>>>>>> johannes >>>>>>>> >>>>>>>> Paras pradhan schrieb: >>>>>>>>> I don't see rgmanager . >>>>>>>>> >>>>>>>>> Here is the o/p from clustat >>>>>>>>> >>>>>>>>> [root at cvtst1 cluster]# clustat >>>>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>>>> Member Status: Quorate >>>>>>>>> >>>>>>>>> Member Name ID >>>>>>>>> Status >>>>>>>>> ------ ---- >>>>>>>>> ---- >>>>>>>>> ------ >>>>>>>>> cvtst2 1 Online >>>>>>>>> cvtst1 2 >>>>>>>>> Online, >>>>>>>>> Local >>>>>>>>> cvtst3 3 Online >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Paras. >>>>>>>>> >>>>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>>>> >>>>>>>>>> what gives you clustat ? >>>>>>>>>> >>>>>>>>>> If rgmanager doesn't show, check out the logs something may have >>>>>>>>>> gone >>>>>>>>>> wrong. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>> >>>>>>>>>>> Change to 7 and i got this log >>>>>>>>>>> >>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting down >>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown >>>>>>>>>>> complete, >>>>>>>>>>> exiting >>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster >>>>>>>>>>> Service >>>>>>>>>>> Manager is stopped. >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>>>> Manager Starting >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service >>>>>>>>>>> Data >>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Resource >>>>>>>>>>> Rules >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building Resource >>>>>>>>>>> Trees >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources >>>>>>>>>>> defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading Failover >>>>>>>>>>> Domains >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events defined >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>>>> Services >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services >>>>>>>>>>> Initialized >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port >>>>>>>>>>> Opened >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> Local >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> cvtst2 >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>> cvtst3 >>>>>>>>>>> UP >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>>>> Processed >>>>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events >>>>>>>>>>> processed >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Anything unusual here? >>>>>>>>>>> >>>>>>>>>>> Paras. >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>>>> >>>>>>>>>>>> It seems 4 is not enough. >>>>>>>>>>>> >>>>>>>>>>>> Brem >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>>>> >>>>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>>>> >>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>>>> complete, >>>>>>>>>>>>> exiting >>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>>>> shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> >>>>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>>>> >>>>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>>>> >>>>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Paras. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>>>> failing >>>>>>>>>>>>>> to >>>>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>>>> >>>>>>>>>>>>>> look at the following page to make rgmanager more verbose. It >>>>>>>>>>>>>> 'll >>>>>>>>>>>>>> help debug.... >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>>>> >>>>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Brem, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time i >>>>>>>>>>>>>>> do >>>>>>>>>>>>>>> not >>>>>>>>>>>>>>> see rgmanager running on the first node. But I do see on other >>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>>>> complete, >>>>>>>>>>>>>>> exiting >>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> It seems service is running , but I do not see rgmanger >>>>>>>>>>>>>>> running >>>>>>>>>>>>>>> using clustat >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Another thing, it would have been more interesting to have a >>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually on >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If your VM is configured to autostart, this will make it >>>>>>>>>>>>>>>> start. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It should normally fail (as it does now). Send out your newly >>>>>>>>>>>>>>>> created >>>>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you >>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping the >>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>>>> exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>>>> set -x >>>>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For few >>>>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>>>> Owner >>>>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>> Recovering >>>>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start on >>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: Stopping >>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: Service >>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Have you started your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>>>> using xm commands out of cluster control (or maybe a >>>>>>>>>>>>>>>>>>>>> thru >>>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes with >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it >>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Member Name >>>>>>>>>>>>>>>>>>>>>> ID Status >>>>>>>>>>>>>>>>>>>>>> ------ ---- >>>>>>>>>>>>>>>>>>>>>> ---- ------ >>>>>>>>>>>>>>>>>>>>>> cvtst2 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>> cvtst1 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>>>> cvtst3 >>>>>>>>>>>>>>>>>>>>>> 3 >>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Service Name >>>>>>>>>>>>>>>>>>>>>> Owner (Last) >>>>>>>>>>>>>>>>>>>>>> State >>>>>>>>>>>>>>>>>>>>>> ------- ---- >>>>>>>>>>>>>>>>>>>>>> ----- ------ >>>>>>>>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>>>>>>>> vm:guest1 >>>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>>> stopped >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> r----- 28939.4 >>>>>>>>>>>>>>>>>>>>>> guest1 7 511 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> -b---- 7727.8 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>>>> Name ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>> State Time(s) >>>>>>>>>>>>>>>>>>>>>> Domain-0 0 3470 >>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>> r----- 31558.9 >>>>>>>>>>>>>>>>>>>>>> guest1 21 511 >>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>> -b---- 7558.2 >>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> can you send an output of clustat of when the VM is >>>>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine service >>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="test"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> post_fail_delay="0" >>>>>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen vm >>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon controls >>>>>>>>>>>>>>>>>>>>>>>>>>> this? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part >>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> - Daniela Anzellotti ------------------------------------ >>>> INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>>> e-mail: daniela.anzellotti at roma1.infn.it >>>> --------------------------------------------------------- >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> - Daniela Anzellotti ------------------------------------ >> INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >> e-mail: daniela.anzellotti at roma1.infn.it >> --------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- - Daniela Anzellotti ------------------------------------ INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 e-mail: daniela.anzellotti at roma1.infn.it --------------------------------------------------------- From Eric.Johnson at mtsallstream.com Tue Oct 6 18:21:34 2009 From: Eric.Johnson at mtsallstream.com (Johnson, Eric) Date: Tue, 6 Oct 2009 13:21:34 -0500 Subject: [Linux-cluster] Cluster online reconfiguration Message-ID: Is there documentation that states how rgmanager handles online reconfiguration of cluster resources when issued a 'ccs_tool update /etc/cluster/cluster.conf'? I have some ext3 file systems under cluster control, and I want to change the force_fsck option from 1 to 0, but I'm not sure if this will trigger an unmount/remount of the file systems, which I definitely don't want happening. RHEL 5.4 32-bit Kernel 2.6.18-164.el5PAE rgmanager-2.0.52-1.el5 cman-2.0.115-1.el5_4.2 openais-0.80.6-8.el5 Thanks, Eric From pradhanparas at gmail.com Tue Oct 6 19:10:19 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 6 Oct 2009 14:10:19 -0500 Subject: [Linux-cluster] openais issue In-Reply-To: <4ACB8594.6090004@roma1.infn.it> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <4AC2986F.8050100@io-consulting.net> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> <4ACB66BA.8040002@roma1.infn.it> <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> <4ACB8594.6090004@roma1.infn.it> Message-ID: <8b711df40910061210k318f5c70n2ec8c8b880cbcaaa@mail.gmail.com> Yes I did that as well. Node1 of my cluster doesn't show (using clust) vm service whereas others do. I guess the problem is with the rgmanager but it looks to be running fine. Paras. On Tue, Oct 6, 2009 at 12:59 PM, Daniela Anzellotti wrote: > Hi Paras, > > did you reboot all the cluster nodes? I needed a complete reboot (actually I > was angry enough to switch everything off and on again): a restart of the > cluster suite was not enough. > > As far as I understood, since the cluster was not able to bring a VM in a > good running state, it decided that the VM has to migrate to another node... > and at the end all the VMs was trying to migrate from one node to another > with the result that I had all VMs starting on all the cluster nodes. > Restarting the cluster suite didn't kill a lot of processes that was > stubbornly trying to migrate virtual machines... > > Daniela > > Paras pradhan wrote: >> >> Adding use_virsh=0 works great. Now I do not have my vm starting at >> all the nodes. This is a good fix. Thanks .. >> >> The only problem left is I do not see rgmanager running on my node1 >> and the clustat of node2 and node3 is reporting vm as migrating >> >> o/p >> >> Service Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Owner (Last) >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >> ?------- ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- ------ >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >> ?vm:guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cvtst1 >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? migrating >> [root at cvtst3 vms]# >> >> >> Thanks >> Paras. >> >> On Tue, Oct 6, 2009 at 10:48 AM, Daniela Anzellotti >> wrote: >>> >>> Hi Paras, >>> >>> yes. At least it looks so... >>> >>> We have a cluster of two nodes + a quorum disk (it's not configured as a >>> "two-node cluster") >>> >>> They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and >>> >>> ?openais-0.80.6-8.el5.x86_64 >>> ?cman-2.0.115-1.el5.x86_64 >>> ?rgmanager-2.0.52-1.el5.x86_64 >>> >>> The XEN VMs access the disk as simple block devices. >>> Disks are on a SAN, configured with Clustered LVM. >>> >>> ?xen-3.0.3-94.el5_4.1.x86_64 >>> ?xen-libs-3.0.3-94.el5_4.1.x86_64 >>> >>> VM configuration files are as the following >>> >>> ?name = "www1" >>> ?uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" >>> ?memory = 1024 >>> ?maxmem = 1024 >>> ?vcpus = 1 >>> ?bootloader = "/usr/bin/pygrub" >>> ?vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] >>> ?disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ >>> ?"phy:/dev/vg_cluster/www1.swap,xvdb,w" ] >>> ?vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] >>> ?on_poweroff = "destroy" >>> ?on_reboot = "restart" >>> ?on_crash = "restart" >>> ?extra = "xencons=tty0 console=tty0" >>> >>> >>> I changed in /etc/cluster/cluster.conf all the VM directive from >>> >>> ?>> ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >>> >>> to >>> >>> ?>> ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >>> >>> >>> Rebooted the cluster nodes and it started working again... >>> >>> As i said I hope I'll not have any other bad surprise (I tested a VM >>> migration and it is working too), but at least cluster it's working now >>> (it >>> was not able to start a VM, before)! >>> >>> Ciao >>> Daniela >>> >>> >>> Paras pradhan wrote: >>>> >>>> So you mean your cluster is running fine with the CMAN >>>> cman-2.0.115-1.el5.x86_64 ? >>>> >>>> Which version of openais are you running? >>>> >>>> Thanks >>>> Paras. >>>> >>>> >>>> On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti >>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I had a problem similar to Paras's one today: yum updated the following >>>>> rpms >>>>> last week and today (I had to restart the cluster) the cluster was not >>>>> able >>>>> to start vm: services. >>>>> >>>>> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >>>>> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >>>>> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >>>>> >>>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >>>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >>>>> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >>>>> >>>>> >>>>> So, after checked the vm.sh script, I add the declaration use_virsh="0" >>>>> in >>>>> the VM definition in the cluster.conf (as suggested by Brem, thanks!) >>>>> and >>>>> everything is now working again. >>>>> >>>>> >>>>> BTW I didn't understand if the problem was caused by the new XEN >>>>> version >>>>> or >>>>> the new openais one, thus I disabled automatic updates for both. >>>>> >>>>> I hope I'll not have any other bad surprise... >>>>> >>>>> Thank you, >>>>> cheers, >>>>> Daniela >>>>> >>>>> >>>>> Paras pradhan wrote: >>>>>> >>>>>> Yes this is very strange. I don't know what to do now. May be re >>>>>> create the cluster? But not a good solution actually. >>>>>> >>>>>> Packages : >>>>>> >>>>>> Kernel: kernel-xen-2.6.18-164.el5 >>>>>> OS: Full updated of CentOS 5.3 except CMAN downgraded to >>>>>> cman-2.0.98-1.el5 >>>>>> >>>>>> Other packages related to cluster suite: >>>>>> >>>>>> rgmanager-2.0.52-1.el5.centos >>>>>> cman-2.0.98-1.el5 >>>>>> xen-3.0.3-80.el5_3.3 >>>>>> xen-libs-3.0.3-80.el5_3.3 >>>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>>> kmod-gfs-0.1.31-3.el5_3.1 >>>>>> gfs-utils-0.1.18-1.el5 >>>>>> gfs2-utils-0.1.62-1.el5 >>>>>> lvm2-2.02.40-6.el5 >>>>>> lvm2-cluster-2.02.40-7.el5 >>>>>> openais-0.80.3-22.el5_3.9 >>>>>> >>>>>> Thanks! >>>>>> Paras. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>>>>> wrote: >>>>>>> >>>>>>> Hi Paras, >>>>>>> >>>>>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>>>>> don't see anything except a bug that causes this, or some >>>>>>> prerequisite >>>>>>> that is not respected. >>>>>>> >>>>>>> May be you could post the versions (os, kernel, packages etc...) you >>>>>>> are using, someone may have hit the same issue with your versions. >>>>>>> >>>>>>> Brem >>>>>>> >>>>>>> 2009/9/30, Paras pradhan : >>>>>>>> >>>>>>>> All of the nodes are synced with ntp server. So this is not the case >>>>>>>> with me. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Paras. >>>>>>>> >>>>>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> make sure the time on the nodes is in sync, apparently when a node >>>>>>>>> has >>>>>>>>> too >>>>>>>>> much offset, you won't see rgmanager (even though the process is >>>>>>>>> running). >>>>>>>>> this happened today and setting the time fixed it for me. afaicr >>>>>>>>> there >>>>>>>>> was >>>>>>>>> no sign of this in the logs though. >>>>>>>>> johannes >>>>>>>>> >>>>>>>>> Paras pradhan schrieb: >>>>>>>>>> >>>>>>>>>> I don't see rgmanager . >>>>>>>>>> >>>>>>>>>> Here is the o/p from clustat >>>>>>>>>> >>>>>>>>>> [root at cvtst1 cluster]# clustat >>>>>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>>>>> Member Status: Quorate >>>>>>>>>> >>>>>>>>>> ?Member Name >>>>>>>>>> ID >>>>>>>>>> Status >>>>>>>>>> ?------ ---- >>>>>>>>>> ---- >>>>>>>>>> ------ >>>>>>>>>> ?cvtst2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 >>>>>>>>>> Online >>>>>>>>>> ?cvtst1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 >>>>>>>>>> Online, >>>>>>>>>> Local >>>>>>>>>> ?cvtst3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 3 >>>>>>>>>> Online >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Paras. >>>>>>>>>> >>>>>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>>>>> >>>>>>>>>>> what gives you clustat ? >>>>>>>>>>> >>>>>>>>>>> If rgmanager doesn't show, check out the logs something may have >>>>>>>>>>> gone >>>>>>>>>>> wrong. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>> >>>>>>>>>>>> Change to 7 and i got this log >>>>>>>>>>>> >>>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting >>>>>>>>>>>> down >>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown >>>>>>>>>>>> complete, >>>>>>>>>>>> exiting >>>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster >>>>>>>>>>>> Service >>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>>>>> Manager Starting >>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service >>>>>>>>>>>> Data >>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading >>>>>>>>>>>> Resource >>>>>>>>>>>> Rules >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building >>>>>>>>>>>> Resource >>>>>>>>>>>> Trees >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources >>>>>>>>>>>> defined >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading >>>>>>>>>>>> Failover >>>>>>>>>>>> Domains >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains >>>>>>>>>>>> defined >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events >>>>>>>>>>>> defined >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>>>>> Services >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services >>>>>>>>>>>> Initialized >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port >>>>>>>>>>>> Opened >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>> Local >>>>>>>>>>>> UP >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>> cvtst2 >>>>>>>>>>>> UP >>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>> cvtst3 >>>>>>>>>>>> UP >>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>>>>> Processed >>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>>>>> Processed >>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>>>>> Processed >>>>>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events >>>>>>>>>>>> processed >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Anything unusual here? >>>>>>>>>>>> >>>>>>>>>>>> Paras. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>>>>> >>>>>>>>>>>>> It seems 4 is not enough. >>>>>>>>>>>>> >>>>>>>>>>>>> Brem >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>>>>> >>>>>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting >>>>>>>>>>>>>> down >>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>>>>> complete, >>>>>>>>>>>>>> exiting >>>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster >>>>>>>>>>>>>> Service >>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource >>>>>>>>>>>>>> Group >>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting >>>>>>>>>>>>>> down >>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>>>>> Service >>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource >>>>>>>>>>>>>> Group >>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>>>>> shutting >>>>>>>>>>>>>> down >>>>>>>>>>>>>> >>>>>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>>>>> >>>>>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>>>>> failing >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> look at the following page ?to make rgmanager more verbose. >>>>>>>>>>>>>>> It >>>>>>>>>>>>>>> 'll >>>>>>>>>>>>>>> help debug.... >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Brem, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time >>>>>>>>>>>>>>>> i >>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>> see rgmanager running on the first node. But I do see on >>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource >>>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>>>>> down >>>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting >>>>>>>>>>>>>>>> down >>>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>>>>> complete, >>>>>>>>>>>>>>>> exiting >>>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource >>>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> It seems service is running , ?but I do not see rgmanger >>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>> using clustat >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Another thing, it would have been more interesting to have >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If your VM is configured to autostart, this will make it >>>>>>>>>>>>>>>>> start. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It should normally fail (as it does now). Send out your >>>>>>>>>>>>>>>>> newly >>>>>>>>>>>>>>>>> created >>>>>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you >>>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>>>>> ?exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>>>>> ?set -x >>>>>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For >>>>>>>>>>>>>>>>>>>>> few >>>>>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>>>> ?Owner >>>>>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>>>> ?----- >>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start >>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>> Stopping >>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>> Recovering >>>>>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start >>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>> Stopping >>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Have you started ?your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>>>>> using xm commands out of cluster control ?(or maybe a >>>>>>>>>>>>>>>>>>>>>> thru >>>>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes >>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it >>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ?Member Name >>>>>>>>>>>>>>>>>>>>>>> ?ID ? Status >>>>>>>>>>>>>>>>>>>>>>> ?------ ---- >>>>>>>>>>>>>>>>>>>>>>> ?---- ------ >>>>>>>>>>>>>>>>>>>>>>> ?cvtst2 >>>>>>>>>>>>>>>>>>>>>>> ?1 >>>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>>> ?cvtst1 >>>>>>>>>>>>>>>>>>>>>>> ?2 >>>>>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>>>>> ?cvtst3 >>>>>>>>>>>>>>>>>>>>>>> ?3 >>>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>>>>>> ?Owner (Last) >>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>>>>>> ?----- ------ >>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>>> r----- ?28939.4 >>>>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 7 ? ? ?511 >>>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>>> -b---- ? 7727.8 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>>> r----- ?31558.9 >>>>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?21 ? ? ?511 >>>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>>> -b---- ? 7558.2 >>>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> can you send an output of clustat ?of when the VM is >>>>>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine >>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> name="test"> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ?>>>>>>>>>>>>>>>>>>>>>>>>> post_fail_delay="0" >>>>>>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen >>>>>>>>>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon >>>>>>>>>>>>>>>>>>>>>>>>>>>> controls >>>>>>>>>>>>>>>>>>>>>>>>>>>> ?this? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part >>>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> - Daniela Anzellotti ------------------------------------ >>>>> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>>>> ?e-mail: daniela.anzellotti at roma1.infn.it >>>>> --------------------------------------------------------- >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> -- >>> - Daniela Anzellotti ------------------------------------ >>> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>> ?e-mail: daniela.anzellotti at roma1.infn.it >>> --------------------------------------------------------- >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > - Daniela Anzellotti ------------------------------------ > ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 > ?e-mail: daniela.anzellotti at roma1.infn.it > --------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From brem.belguebli at gmail.com Tue Oct 6 20:03:03 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 6 Oct 2009 22:03:03 +0200 Subject: [Linux-cluster] Cluster online reconfiguration In-Reply-To: References: Message-ID: <29ae894c0910061303y34ea2e52n193319c3dc0ffdd6@mail.gmail.com> Hi, by experience, all changed resources will be restarted once the cluster version number is changed.Your resource will be restarted. I didn't find any documentation about that, it is just "short" experience. Among the funny experiences, suppose you have a service containing a lvm resource child, which contains itself a FS child. Your service is running under cluster control. if you add a nfs export under this running resource, after updating the conf (ccs_tool update), the whole service won't be restarted but only the nfsexport will be online added. 2009/10/6 Johnson, Eric > Is there documentation that states how rgmanager handles online > reconfiguration of cluster resources when issued a 'ccs_tool update > /etc/cluster/cluster.conf'? > > I have some ext3 file systems under cluster control, and I want to > change the force_fsck option from 1 to 0, but I'm not sure if this will > trigger an unmount/remount of the file systems, which I definitely don't > want happening. > > RHEL 5.4 32-bit > Kernel 2.6.18-164.el5PAE > rgmanager-2.0.52-1.el5 > cman-2.0.115-1.el5_4.2 > openais-0.80.6-8.el5 > > Thanks, > Eric > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Tue Oct 6 20:54:10 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 6 Oct 2009 15:54:10 -0500 Subject: [Linux-cluster] openais issue In-Reply-To: <8b711df40910061210k318f5c70n2ec8c8b880cbcaaa@mail.gmail.com> References: <8b711df40909161052i41f9c007na85cf2ef2919ae0f@mail.gmail.com> <8b711df40909300749i2af6a711v2f866c55a046a388@mail.gmail.com> <29ae894c0909300802oa8d72c5k892115a3e2f67db9@mail.gmail.com> <8b711df40909300811q1724aa68hfea589bbb32b4ce5@mail.gmail.com> <4AC9E445.3050702@roma1.infn.it> <8b711df40910060831n6b6c7a6aw27761c89403c51e1@mail.gmail.com> <4ACB66BA.8040002@roma1.infn.it> <8b711df40910060920q44d502a6j34e3ca6cfeb63128@mail.gmail.com> <4ACB8594.6090004@roma1.infn.it> <8b711df40910061210k318f5c70n2ec8c8b880cbcaaa@mail.gmail.com> Message-ID: <8b711df40910061354g7042fb43va73de4199265a5ef@mail.gmail.com> Ok . Didnot want to , but I resinstalled the cluster packages (except CMAN) , rebooted the node and add the node to the cluster again. It worked fine now. Paras. On Tue, Oct 6, 2009 at 2:10 PM, Paras pradhan wrote: > Yes I did that as well. Node1 of my cluster doesn't show (using clust) > vm service whereas others do. I guess the problem is with the > rgmanager but it looks to be running fine. > > > Paras. > > > On Tue, Oct 6, 2009 at 12:59 PM, Daniela Anzellotti > wrote: >> Hi Paras, >> >> did you reboot all the cluster nodes? I needed a complete reboot (actually I >> was angry enough to switch everything off and on again): a restart of the >> cluster suite was not enough. >> >> As far as I understood, since the cluster was not able to bring a VM in a >> good running state, it decided that the VM has to migrate to another node... >> and at the end all the VMs was trying to migrate from one node to another >> with the result that I had all VMs starting on all the cluster nodes. >> Restarting the cluster suite didn't kill a lot of processes that was >> stubbornly trying to migrate virtual machines... >> >> Daniela >> >> Paras pradhan wrote: >>> >>> Adding use_virsh=0 works great. Now I do not have my vm starting at >>> all the nodes. This is a good fix. Thanks .. >>> >>> The only problem left is I do not see rgmanager running on my node1 >>> and the clustat of node2 and node3 is reporting vm as migrating >>> >>> o/p >>> >>> Service Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Owner (Last) >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? State >>> ?------- ---- ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- ------ >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>> ?vm:guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cvtst1 >>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? migrating >>> [root at cvtst3 vms]# >>> >>> >>> Thanks >>> Paras. >>> >>> On Tue, Oct 6, 2009 at 10:48 AM, Daniela Anzellotti >>> wrote: >>>> >>>> Hi Paras, >>>> >>>> yes. At least it looks so... >>>> >>>> We have a cluster of two nodes + a quorum disk (it's not configured as a >>>> "two-node cluster") >>>> >>>> They are running Scientific Linux 5.x, kernel 2.6.18-128.7.1.el5xen and >>>> >>>> ?openais-0.80.6-8.el5.x86_64 >>>> ?cman-2.0.115-1.el5.x86_64 >>>> ?rgmanager-2.0.52-1.el5.x86_64 >>>> >>>> The XEN VMs access the disk as simple block devices. >>>> Disks are on a SAN, configured with Clustered LVM. >>>> >>>> ?xen-3.0.3-94.el5_4.1.x86_64 >>>> ?xen-libs-3.0.3-94.el5_4.1.x86_64 >>>> >>>> VM configuration files are as the following >>>> >>>> ?name = "www1" >>>> ?uuid = "3bd3e910-23c0-97ee-55ab-086260ef1e53" >>>> ?memory = 1024 >>>> ?maxmem = 1024 >>>> ?vcpus = 1 >>>> ?bootloader = "/usr/bin/pygrub" >>>> ?vfb = [ "type=vnc,vncunused=1,keymap=en-us" ] >>>> ?disk = [ "phy:/dev/vg_cluster/www1.disk,xvda,w", \ >>>> ?"phy:/dev/vg_cluster/www1.swap,xvdb,w" ] >>>> ?vif = [ "mac=00:16:3e:da:00:07,bridge=xenbr1" ] >>>> ?on_poweroff = "destroy" >>>> ?on_reboot = "restart" >>>> ?on_crash = "restart" >>>> ?extra = "xencons=tty0 console=tty0" >>>> >>>> >>>> I changed in /etc/cluster/cluster.conf all the VM directive from >>>> >>>> ?>>> ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >>>> >>>> to >>>> >>>> ?>>> ?migrate="live" name="www1" path="/etc/xen" recovery="restart"/> >>>> >>>> >>>> Rebooted the cluster nodes and it started working again... >>>> >>>> As i said I hope I'll not have any other bad surprise (I tested a VM >>>> migration and it is working too), but at least cluster it's working now >>>> (it >>>> was not able to start a VM, before)! >>>> >>>> Ciao >>>> Daniela >>>> >>>> >>>> Paras pradhan wrote: >>>>> >>>>> So you mean your cluster is running fine with the CMAN >>>>> cman-2.0.115-1.el5.x86_64 ? >>>>> >>>>> Which version of openais are you running? >>>>> >>>>> Thanks >>>>> Paras. >>>>> >>>>> >>>>> On Mon, Oct 5, 2009 at 7:19 AM, Daniela Anzellotti >>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I had a problem similar to Paras's one today: yum updated the following >>>>>> rpms >>>>>> last week and today (I had to restart the cluster) the cluster was not >>>>>> able >>>>>> to start vm: services. >>>>>> >>>>>> Oct 02 05:31:05 Updated: openais-0.80.6-8.el5.x86_64 >>>>>> Oct 02 05:31:07 Updated: cman-2.0.115-1.el5.x86_64 >>>>>> Oct 02 05:31:10 Updated: rgmanager-2.0.52-1.el5.x86_64 >>>>>> >>>>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.x86_64 >>>>>> Oct 03 04:03:12 Updated: xen-libs-3.0.3-94.el5_4.1.i386 >>>>>> Oct 03 04:03:16 Updated: xen-3.0.3-94.el5_4.1.x86_64 >>>>>> >>>>>> >>>>>> So, after checked the vm.sh script, I add the declaration use_virsh="0" >>>>>> in >>>>>> the VM definition in the cluster.conf (as suggested by Brem, thanks!) >>>>>> and >>>>>> everything is now working again. >>>>>> >>>>>> >>>>>> BTW I didn't understand if the problem was caused by the new XEN >>>>>> version >>>>>> or >>>>>> the new openais one, thus I disabled automatic updates for both. >>>>>> >>>>>> I hope I'll not have any other bad surprise... >>>>>> >>>>>> Thank you, >>>>>> cheers, >>>>>> Daniela >>>>>> >>>>>> >>>>>> Paras pradhan wrote: >>>>>>> >>>>>>> Yes this is very strange. I don't know what to do now. May be re >>>>>>> create the cluster? But not a good solution actually. >>>>>>> >>>>>>> Packages : >>>>>>> >>>>>>> Kernel: kernel-xen-2.6.18-164.el5 >>>>>>> OS: Full updated of CentOS 5.3 except CMAN downgraded to >>>>>>> cman-2.0.98-1.el5 >>>>>>> >>>>>>> Other packages related to cluster suite: >>>>>>> >>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>> cman-2.0.98-1.el5 >>>>>>> xen-3.0.3-80.el5_3.3 >>>>>>> xen-libs-3.0.3-80.el5_3.3 >>>>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>>>> kmod-gfs-xen-0.1.31-3.el5_3.1 >>>>>>> kmod-gfs-0.1.31-3.el5_3.1 >>>>>>> gfs-utils-0.1.18-1.el5 >>>>>>> gfs2-utils-0.1.62-1.el5 >>>>>>> lvm2-2.02.40-6.el5 >>>>>>> lvm2-cluster-2.02.40-7.el5 >>>>>>> openais-0.80.3-22.el5_3.9 >>>>>>> >>>>>>> Thanks! >>>>>>> Paras. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 30, 2009 at 10:02 AM, brem belguebli >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Paras, >>>>>>>> >>>>>>>> Your cluster.conf file seems correct. If it is not a ntp issue, I >>>>>>>> don't see anything except a bug that causes this, or some >>>>>>>> prerequisite >>>>>>>> that is not respected. >>>>>>>> >>>>>>>> May be you could post the versions (os, kernel, packages etc...) you >>>>>>>> are using, someone may have hit the same issue with your versions. >>>>>>>> >>>>>>>> Brem >>>>>>>> >>>>>>>> 2009/9/30, Paras pradhan : >>>>>>>>> >>>>>>>>> All of the nodes are synced with ntp server. So this is not the case >>>>>>>>> with me. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Paras. >>>>>>>>> >>>>>>>>> On Tue, Sep 29, 2009 at 6:29 PM, Johannes Ru?ek >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> make sure the time on the nodes is in sync, apparently when a node >>>>>>>>>> has >>>>>>>>>> too >>>>>>>>>> much offset, you won't see rgmanager (even though the process is >>>>>>>>>> running). >>>>>>>>>> this happened today and setting the time fixed it for me. afaicr >>>>>>>>>> there >>>>>>>>>> was >>>>>>>>>> no sign of this in the logs though. >>>>>>>>>> johannes >>>>>>>>>> >>>>>>>>>> Paras pradhan schrieb: >>>>>>>>>>> >>>>>>>>>>> I don't see rgmanager . >>>>>>>>>>> >>>>>>>>>>> Here is the o/p from clustat >>>>>>>>>>> >>>>>>>>>>> [root at cvtst1 cluster]# clustat >>>>>>>>>>> Cluster Status for test @ Tue Sep 29 15:53:33 2009 >>>>>>>>>>> Member Status: Quorate >>>>>>>>>>> >>>>>>>>>>> ?Member Name >>>>>>>>>>> ID >>>>>>>>>>> Status >>>>>>>>>>> ?------ ---- >>>>>>>>>>> ---- >>>>>>>>>>> ------ >>>>>>>>>>> ?cvtst2 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?1 >>>>>>>>>>> Online >>>>>>>>>>> ?cvtst1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2 >>>>>>>>>>> Online, >>>>>>>>>>> Local >>>>>>>>>>> ?cvtst3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 3 >>>>>>>>>>> Online >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Paras. >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 29, 2009 at 3:44 PM, brem belguebli >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> It looks correct, rgmanager seems to start on all nodes >>>>>>>>>>>> >>>>>>>>>>>> what gives you clustat ? >>>>>>>>>>>> >>>>>>>>>>>> If rgmanager doesn't show, check out the logs something may have >>>>>>>>>>>> gone >>>>>>>>>>>> wrong. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>> >>>>>>>>>>>>> Change to 7 and i got this log >>>>>>>>>>>>> >>>>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Shutting >>>>>>>>>>>>> down >>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutting down >>>>>>>>>>>>> Sep 29 15:33:50 cvtst1 clurgmgrd[22869]: Shutdown >>>>>>>>>>>>> complete, >>>>>>>>>>>>> exiting >>>>>>>>>>>>> Sep 29 15:33:50 cvtst1 rgmanager: [23295]: Cluster >>>>>>>>>>>>> Service >>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Resource Group >>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading Service >>>>>>>>>>>>> Data >>>>>>>>>>>>> Sep 29 15:33:51 cvtst1 clurgmgrd[23324]: Loading >>>>>>>>>>>>> Resource >>>>>>>>>>>>> Rules >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 21 rules loaded >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Building >>>>>>>>>>>>> Resource >>>>>>>>>>>>> Trees >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 0 resources >>>>>>>>>>>>> defined >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Loading >>>>>>>>>>>>> Failover >>>>>>>>>>>>> Domains >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 domains >>>>>>>>>>>>> defined >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: 1 events >>>>>>>>>>>>> defined >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Initializing >>>>>>>>>>>>> Services >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Services >>>>>>>>>>>>> Initialized >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: Event: Port >>>>>>>>>>>>> Opened >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>>> Local >>>>>>>>>>>>> UP >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>>> cvtst2 >>>>>>>>>>>>> UP >>>>>>>>>>>>> Sep 29 15:33:52 cvtst1 clurgmgrd[23324]: State change: >>>>>>>>>>>>> cvtst3 >>>>>>>>>>>>> UP >>>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (1:2:1) >>>>>>>>>>>>> Processed >>>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:1:1) >>>>>>>>>>>>> Processed >>>>>>>>>>>>> Sep 29 15:33:57 cvtst1 clurgmgrd[23324]: Event (0:3:1) >>>>>>>>>>>>> Processed >>>>>>>>>>>>> Sep 29 15:34:02 cvtst1 clurgmgrd[23324]: 3 events >>>>>>>>>>>>> processed >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Anything unusual here? >>>>>>>>>>>>> >>>>>>>>>>>>> Paras. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 29, 2009 at 11:51 AM, brem belguebli >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I use log_level=7 to have more debugging info. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It seems 4 is not enough. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brem >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2009/9/29, Paras pradhan : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Withe log_level of 3 I got only this >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 rgmanager: [7170]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>> Sep 29 10:31:31 cvtst1 clurgmgrd[6673]: Shutting down >>>>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 clurgmgrd[6673]: Shutdown >>>>>>>>>>>>>>> complete, >>>>>>>>>>>>>>> exiting >>>>>>>>>>>>>>> Sep 29 10:31:41 cvtst1 rgmanager: [7170]: Cluster >>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>> Sep 29 10:31:42 cvtst1 clurgmgrd[7224]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> Sep 29 10:39:06 cvtst1 rgmanager: [10327]: Shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 rgmanager: [10327]: Cluster >>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>> Sep 29 10:39:16 cvtst1 clurgmgrd[10380]: Resource >>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>> Sep 29 10:39:52 cvtst1 clurgmgrd[10380]: Member 1 >>>>>>>>>>>>>>> shutting >>>>>>>>>>>>>>> down >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I do not know what the last line means. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> rgmanager version I am running is: >>>>>>>>>>>>>>> rgmanager-2.0.52-1.el5.centos >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't what has gone wrong. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 6:41 PM, brem belguebli >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> you mean it stopped successfully on all the nodes but it is >>>>>>>>>>>>>>>> failing >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> start only on node cvtst1 ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> look at the following page ?to make rgmanager more verbose. >>>>>>>>>>>>>>>> It >>>>>>>>>>>>>>>> 'll >>>>>>>>>>>>>>>> help debug.... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> http://sources.redhat.com/cluster/wiki/RGManager >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> at Logging Configuration section >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/9/29 Paras pradhan : >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Brem, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When I try to restart rgmanager on all the nodes, this time >>>>>>>>>>>>>>>>> i >>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>> see rgmanager running on the first node. But I do see on >>>>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>> nodes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Log on the first node: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sep 28 18:13:58 cvtst1 clurgmgrd[24099]: Resource >>>>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 rgmanager: [24627]: Shutting >>>>>>>>>>>>>>>>> down >>>>>>>>>>>>>>>>> Cluster Service Manager... >>>>>>>>>>>>>>>>> Sep 28 18:17:29 cvtst1 clurgmgrd[24099]: Shutting >>>>>>>>>>>>>>>>> down >>>>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 clurgmgrd[24099]: Shutdown >>>>>>>>>>>>>>>>> complete, >>>>>>>>>>>>>>>>> exiting >>>>>>>>>>>>>>>>> Sep 28 18:17:39 cvtst1 rgmanager: [24627]: Cluster >>>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>>> Manager is stopped. >>>>>>>>>>>>>>>>> Sep 28 18:17:40 cvtst1 clurgmgrd[24679]: Resource >>>>>>>>>>>>>>>>> Group >>>>>>>>>>>>>>>>> Manager Starting >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> It seems service is running , ?but I do not see rgmanger >>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>> using clustat >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Don't know what is going on. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Sep 28, 2009 at 5:46 PM, brem belguebli >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Paras, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Another thing, it would have been more interesting to have >>>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>> DEBUG not a stop. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> That's why I was asking you to first stop the vm manually >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>>> nodes, stop eventually rgmanager on all the nodes to reset >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> potential wrong states you may have, restart rgmanager. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If your VM is configured to autostart, this will make it >>>>>>>>>>>>>>>>>> start. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> It should normally fail (as it does now). Send out your >>>>>>>>>>>>>>>>>> newly >>>>>>>>>>>>>>>>>> created >>>>>>>>>>>>>>>>>> DEBUG file. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2009/9/29 brem belguebli : >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Paras, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I don't know the xen/cluster combination well, but if I do >>>>>>>>>>>>>>>>>>> remember >>>>>>>>>>>>>>>>>>> well, I think I've read somewhere that when using xen you >>>>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> declare the use_virsh=0 key in the VM definition in the >>>>>>>>>>>>>>>>>>> cluster.conf. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This would make rgmanager use xm commands instead of virsh >>>>>>>>>>>>>>>>>>> The DEBUG output shows clearly that you are using virsh to >>>>>>>>>>>>>>>>>>> manage >>>>>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>>>>> VM instead of xm commands. >>>>>>>>>>>>>>>>>>> Check out the RH docs about virtualization >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm not a 100% sure about that, I may be completely wrong. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2009/9/28 Paras pradhan : >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The only thing I noticed is the message after stopping >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>> using xm >>>>>>>>>>>>>>>>>>>> in all nodes and starting using clusvcadm is >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> "Virtual machine guest1 is blocked" >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The whole DEBUG file is attached. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:53 PM, brem belguebli >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There's a problem with the script that is called by >>>>>>>>>>>>>>>>>>>>> rgmanager to >>>>>>>>>>>>>>>>>>>>> start >>>>>>>>>>>>>>>>>>>>> the VM, I don't know what causes it >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> May be you should try something like : >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 1) stop the VM on all nodes with xm commands >>>>>>>>>>>>>>>>>>>>> 2) edit the /usr/share/cluster/vm.sh script and add the >>>>>>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>>>>>> lines (after the #!/bin/bash ): >>>>>>>>>>>>>>>>>>>>> ?exec >/tmp/DEBUG 2>&1 >>>>>>>>>>>>>>>>>>>>> ?set -x >>>>>>>>>>>>>>>>>>>>> 3) start the VM with clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It should fail as it did before. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> edit the the /tmp/DEBUG file and you will be able to see >>>>>>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>>>>> fails (it may generate a lot of debug) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 4) remove the debug lines from /usr/share/cluster/vm.sh >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Post the DEBUG file if you're not able to see where it >>>>>>>>>>>>>>>>>>>>> fails. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Brem >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2009/9/26 Paras pradhan : >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> No I am not manually starting not using automatic init >>>>>>>>>>>>>>>>>>>>>> scripts. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I started the vm using: clusvcadm -e vm:guest1 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I have just stopped using clusvcadm -s vm:guest1. For >>>>>>>>>>>>>>>>>>>>>> few >>>>>>>>>>>>>>>>>>>>>> seconds it >>>>>>>>>>>>>>>>>>>>>> says guest1 started . But after a while I can see the >>>>>>>>>>>>>>>>>>>>>> guest1 on >>>>>>>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>>>>>>> three nodes. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> clustat says: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>>>>> ?Owner >>>>>>>>>>>>>>>>>>>>>> (Last) >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>>>>> ?----- >>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> But I can see the vm from xm li. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> This is what I can see from the log: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: start >>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:01 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>>> Stopping >>>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:02 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:15 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>>> Recovering >>>>>>>>>>>>>>>>>>>>>> failed >>>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: start >>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>>> "guest1" >>>>>>>>>>>>>>>>>>>>>> returned 1 (generic error) >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: #68: >>>>>>>>>>>>>>>>>>>>>> Failed >>>>>>>>>>>>>>>>>>>>>> to start >>>>>>>>>>>>>>>>>>>>>> vm:guest1; return value: 1 >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:16 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>>> Stopping >>>>>>>>>>>>>>>>>>>>>> service vm:guest1 >>>>>>>>>>>>>>>>>>>>>> Sep 25 17:19:17 cvtst1 clurgmgrd[4298]: >>>>>>>>>>>>>>>>>>>>>> Service >>>>>>>>>>>>>>>>>>>>>> vm:guest1 is >>>>>>>>>>>>>>>>>>>>>> recovering >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 5:07 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Have you started ?your VM via rgmanager (clusvcadm -e >>>>>>>>>>>>>>>>>>>>>>> vm:guest1) or >>>>>>>>>>>>>>>>>>>>>>> using xm commands out of cluster control ?(or maybe a >>>>>>>>>>>>>>>>>>>>>>> thru >>>>>>>>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>>>>>>> automatic init script ?) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> When clustered, you should never be starting services >>>>>>>>>>>>>>>>>>>>>>> (manually or >>>>>>>>>>>>>>>>>>>>>>> thru automatic init script) out of cluster control >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The thing would be to stop your vm on all the nodes >>>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> adequate >>>>>>>>>>>>>>>>>>>>>>> xm command (not using xen myself) and try to start it >>>>>>>>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>>>> clusvcadm. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Then see if it is started on all nodes (send clustat >>>>>>>>>>>>>>>>>>>>>>> output) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Ok. Please see below. my vm is running on all nodes >>>>>>>>>>>>>>>>>>>>>>>> though >>>>>>>>>>>>>>>>>>>>>>>> clustat >>>>>>>>>>>>>>>>>>>>>>>> says it is stopped. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# clustat >>>>>>>>>>>>>>>>>>>>>>>> Cluster Status for test @ Fri Sep 25 16:52:34 2009 >>>>>>>>>>>>>>>>>>>>>>>> Member Status: Quorate >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ?Member Name >>>>>>>>>>>>>>>>>>>>>>>> ?ID ? Status >>>>>>>>>>>>>>>>>>>>>>>> ?------ ---- >>>>>>>>>>>>>>>>>>>>>>>> ?---- ------ >>>>>>>>>>>>>>>>>>>>>>>> ?cvtst2 >>>>>>>>>>>>>>>>>>>>>>>> ?1 >>>>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>>>> ?cvtst1 >>>>>>>>>>>>>>>>>>>>>>>> ?2 >>>>>>>>>>>>>>>>>>>>>>>> Online, >>>>>>>>>>>>>>>>>>>>>>>> Local, rgmanager >>>>>>>>>>>>>>>>>>>>>>>> ?cvtst3 >>>>>>>>>>>>>>>>>>>>>>>> ?3 >>>>>>>>>>>>>>>>>>>>>>>> Online, rgmanager >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> ?Service Name >>>>>>>>>>>>>>>>>>>>>>>> ?Owner (Last) >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?State >>>>>>>>>>>>>>>>>>>>>>>> ?------- ---- >>>>>>>>>>>>>>>>>>>>>>>> ?----- ------ >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?----- >>>>>>>>>>>>>>>>>>>>>>>> ?vm:guest1 >>>>>>>>>>>>>>>>>>>>>>>> (none) >>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst1 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 ~]# xm li >>>>>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>>>> r----- ?28939.4 >>>>>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 7 ? ? ?511 >>>>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>>>> -b---- ? 7727.8 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> o/p of xm li on cvtst2 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst2 ~]# xm li >>>>>>>>>>>>>>>>>>>>>>>> Name ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ID Mem(MiB) >>>>>>>>>>>>>>>>>>>>>>>> VCPUs >>>>>>>>>>>>>>>>>>>>>>>> State ? Time(s) >>>>>>>>>>>>>>>>>>>>>>>> Domain-0 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0 ? ? 3470 >>>>>>>>>>>>>>>>>>>>>>>> 2 >>>>>>>>>>>>>>>>>>>>>>>> r----- ?31558.9 >>>>>>>>>>>>>>>>>>>>>>>> guest1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?21 ? ? ?511 >>>>>>>>>>>>>>>>>>>>>>>> 1 >>>>>>>>>>>>>>>>>>>>>>>> -b---- ? 7558.2 >>>>>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 25, 2009 at 4:22 PM, brem belguebli >>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> It looks like no. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> can you send an output of clustat ?of when the VM is >>>>>>>>>>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>>>>>>>> multiple nodes at the same time? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> And by the way, another one after having stopped >>>>>>>>>>>>>>>>>>>>>>>>> (clusvcadm >>>>>>>>>>>>>>>>>>>>>>>>> -s vm:guest1) ? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2009/9/25 Paras pradhan : >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Anyone having issue as mine? Virtual machine >>>>>>>>>>>>>>>>>>>>>>>>>> service >>>>>>>>>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>>>>>>> not >>>>>>>>>>>>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>>>>>>>>>>>> properly handled by the cluster. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Sep 21, 2009 at 9:55 AM, Paras pradhan >>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Ok.. here is my cluster.conf file >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# more cluster.conf >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> name="test"> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> post_fail_delay="0" >>>>>>>>>>>>>>>>>>>>>>>>>>> post_join_delay="3"/> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> votes="1"> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> nofailback="0" ordered="1" restricted="0"> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst2" priority="3"/> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst1" priority="1"/> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> name="cvtst3" priority="2"/> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? ? ? ? ?>>>>>>>>>>>>>>>>>>>>>>>>>> exclusive="0" max_restarts="0" >>>>>>>>>>>>>>>>>>>>>>>>>>> name="guest1" path="/vms" recovery="r >>>>>>>>>>>>>>>>>>>>>>>>>>> estart" restart_expire_time="0"/> >>>>>>>>>>>>>>>>>>>>>>>>>>> ? ? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> [root at cvtst1 cluster]# >>>>>>>>>>>>>>>>>>>>>>>>>>> ------ >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>>>>>>>> Paras. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Sun, Sep 20, 2009 at 9:44 AM, Volker Dormeyer >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Sep 18, 2009 at 05:08:57PM -0500, >>>>>>>>>>>>>>>>>>>>>>>>>>>> Paras pradhan wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am using cluster suite for HA of xen virtual >>>>>>>>>>>>>>>>>>>>>>>>>>>>> machines. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now I am >>>>>>>>>>>>>>>>>>>>>>>>>>>>> having another problem. When I start the my xen >>>>>>>>>>>>>>>>>>>>>>>>>>>>> vm >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> one node, it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> also starts on other nodes. Which daemon >>>>>>>>>>>>>>>>>>>>>>>>>>>>> controls >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ?this? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> This is usually done bei clurgmgrd (which is part >>>>>>>>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>>>>> rgmanager >>>>>>>>>>>>>>>>>>>>>>>>>>>> package). To me, this sounds like a configuration >>>>>>>>>>>>>>>>>>>>>>>>>>>> problem. Maybe, >>>>>>>>>>>>>>>>>>>>>>>>>>>> you can post your cluster.conf? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>> Volker >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Linux-cluster mailing list >>>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Linux-cluster mailing list >>>>>>>>>> Linux-cluster at redhat.com >>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Linux-cluster mailing list >>>>>>>>> Linux-cluster at redhat.com >>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>>> >>>>>>>> -- >>>>>>>> Linux-cluster mailing list >>>>>>>> Linux-cluster at redhat.com >>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>>> >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>>> >>>>>> -- >>>>>> - Daniela Anzellotti ------------------------------------ >>>>>> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>>>>> ?e-mail: daniela.anzellotti at roma1.infn.it >>>>>> --------------------------------------------------------- >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> -- >>>> - Daniela Anzellotti ------------------------------------ >>>> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >>>> ?e-mail: daniela.anzellotti at roma1.infn.it >>>> --------------------------------------------------------- >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> -- >> - Daniela Anzellotti ------------------------------------ >> ?INFN Roma - tel.: +39.06.49914282 - fax: +39.06.490354 >> ?e-mail: daniela.anzellotti at roma1.infn.it >> --------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > From gianluca.cecchi at gmail.com Wed Oct 7 15:03:50 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 7 Oct 2009 17:03:50 +0200 Subject: [Linux-cluster] Suggestion for backbone network maintenance Message-ID: <561c252c0910070803t77c5a56fxcc7ab29c75ba0697@mail.gmail.com> Hello, cluster rh el 5.3 with 2 nodes and a quorum disk with heuristics. The nodes are in different sites. At this moment inside cluster.conf I have this: there is a planning for backbone network maintenance and I'm gong to have interruption on backbone switches in both sides, so that the gw will be not reachable in different time windows..... I would like to "downgrade" the cluster to a sort of standalone server, to prevent scenario of shutdown abort of oracle, due to ping-pong of service.... and without stopping service itself. Only thing is that when the network link will be down, the service will be down too (accettable by users), but without impact to the backend db on the cluster. My idea is to temporarily extend tko inside heuristic, so that the count-down for loss of quorum will not arrive at a end during downtime. So that if overall planned maintenance is 2h I will put something like tko=1500 on the fly. So my planned commands would be, where node1 is providing service and node2 is standby: node2 - shutdown -h now (to simplify things...) node1 - change to cluster.conf incrementing version number and putting tko=1500 - ccs_tool update /etc/cluster/cluster.conf - cman_tool version -r the new tko will be dynamically in place? If so, wait for maintenance completion and then node1 - change to cluster.conf incrementing version number and putting tko=20 again - ccs_tool update /etc/cluster/cluster.conf - cman_tool version -r node2 power on it will keep the new configuration without downsides (correct?) Any hints or comments? Thanks in advance, Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From arturogf at ugr.es Thu Oct 8 01:00:54 2009 From: arturogf at ugr.es (Arturo Gonzalez Ferrer) Date: Thu, 8 Oct 2009 03:00:54 +0200 Subject: [Linux-cluster] multipath devices and Conga storage tab Message-ID: Hello, I'm using RHEL 5.4 with an HP MSA2312fc array, and I'm trying to configure a 3-node cluster sharing a GFS filesystem. I want to do this by using Conga. All went good, the cluster is up, but when going to the "storage" tab, I saw 8 devices (sda, sdb...sdh). This is caused of course by having two HBA in every node (there is another 4th node that is going to access another volume on the MSA). So, I went for multipathing: I've already installed HPDM (hp multipath) in the first node and I can see now on the command line both volumes that I created (/dev/mapper/mpath0 with 1TB and /dev/mapper/mpath1 with 600GB), but the problem is that in Conga I can only see the previous 8 devices... so I'm starting to think that Conga does not look at multipath devices...or surely I don't know what I'm doing wrong in order to display them somehow. Could you give me some direction to follow now? There is no much documentation about Conga storage management. Cheers, Arturo. -------------- next part -------------- An HTML attachment was scrubbed... URL: From brem.belguebli at gmail.com Thu Oct 8 06:20:02 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Thu, 8 Oct 2009 08:20:02 +0200 Subject: [Linux-cluster] multipath devices and Conga storage tab In-Reply-To: References: Message-ID: <29ae894c0910072320q21d85b0cjffdba0d6ce8b8e9a@mail.gmail.com> Hello Arturo, indeed Conga doesn't seem to summarize all the /dev/sd devices to their belonging mpath one. A potential side effect of this being ?if you have an important number of /dev/sd devices conga will timeout without showing you any result. I personnaly gave up using conga. About HP-DM, there is a OS's shipped solution (on which relies HP-DM!) called device-mapper-multipath, there's no necessity to use the HP one. Brem 2009/10/8 Arturo Gonzalez Ferrer > > Hello, > I'm using RHEL 5.4 with an HP MSA2312fc array, and I'm trying to configure a 3-node cluster sharing a GFS filesystem. I want to do this by using Conga. All went good, the cluster is up, but when going to the "storage" tab, I saw 8 devices (sda, sdb...sdh). This is caused of course by having two HBA in every node (there is another 4th node that is going to access another volume on the MSA). > So, I went for multipathing: > I've already installed HPDM (hp multipath) in the first node and I can see now on the command line both volumes that I created (/dev/mapper/mpath0 with 1TB and /dev/mapper/mpath1 with 600GB), but the problem is that in Conga I can only see the previous 8 devices... so I'm starting to think that Conga does not look at multipath devices...or surely I don't know what I'm doing wrong in order to display them somehow. > Could you give me some direction to follow now? There is no much documentation about Conga storage management. > Cheers, > Arturo. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From arturogf at gmail.com Thu Oct 8 07:03:52 2009 From: arturogf at gmail.com (Arturo Gonzalez Ferrer) Date: Thu, 8 Oct 2009 09:03:52 +0200 Subject: [Linux-cluster] multipath devices and Conga storage tab In-Reply-To: <29ae894c0910072320q21d85b0cjffdba0d6ce8b8e9a@mail.gmail.com> References: <29ae894c0910072320q21d85b0cjffdba0d6ce8b8e9a@mail.gmail.com> Message-ID: 2009/10/8 brem belguebli > Hello Arturo, > indeed Conga doesn't seem to summarize all the /dev/sd devices to > their belonging mpath one. > A potential side effect of this being if you have an important number > of /dev/sd devices conga will timeout without showing you any result. > Well, finally i got to expose it to conga, and it was easy. Only i had to do is: pvcreate /dev/mapper/mpath0p1 Once the physical volume exists, it appears in conga when creating a new volume group, so you can choose it. Doing the same on the rest of the nodes i finally could create the clustered volume group and the corresponding gfs2 filesystem, and mount it on 3 nodes. > > I personnaly gave up using conga. > > About HP-DM, there is a OS's shipped solution (on which relies HP-DM!) > called device-mapper-multipath, there's no necessity to use the HP > one. > Uhm, I know, but supposedly it give support for my MSA2312fc ... so i don't know if it would run only with device-mapper-multipath. Thank you very much, Arturo. > Brem > > 2009/10/8 Arturo Gonzalez Ferrer > > > > Hello, > > I'm using RHEL 5.4 with an HP MSA2312fc array, and I'm trying to > configure a 3-node cluster sharing a GFS filesystem. I want to do this by > using Conga. All went good, the cluster is up, but when going to the > "storage" tab, I saw 8 devices (sda, sdb...sdh). This is caused of course by > having two HBA in every node (there is another 4th node that is going to > access another volume on the MSA). > > So, I went for multipathing: > > I've already installed HPDM (hp multipath) in the first node and I can > see now on the command line both volumes that I created (/dev/mapper/mpath0 > with 1TB and /dev/mapper/mpath1 with 600GB), but the problem is that in > Conga I can only see the previous 8 devices... so I'm starting to think that > Conga does not look at multipath devices...or surely I don't know what I'm > doing wrong in order to display them somehow. > > Could you give me some direction to follow now? There is no much > documentation about Conga storage management. > > Cheers, > > Arturo. > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From volker at ixolution.de Thu Oct 8 07:37:08 2009 From: volker at ixolution.de (Volker Dormeyer) Date: Thu, 8 Oct 2009 09:37:08 +0200 Subject: [Linux-cluster] multipath devices and Conga storage tab In-Reply-To: References: <29ae894c0910072320q21d85b0cjffdba0d6ce8b8e9a@mail.gmail.com> Message-ID: <20091008073708.GA2663@dijkstra> Hi, On Thu, Oct 08, 2009 at 09:03:52AM +0200, Arturo Gonzalez Ferrer wrote: > Uhm, I know, but supposedly it give support for my MSA2312fc ... so i don't > know if it would run only with device-mapper-multipath. I guess you downloaded the package from http://www.hp.com/go/devicemapper right? All you need from the package is the template for the multipath configuration which contains the necessary settings for the HP-arrays. For RHEL5, this should be: multipath.conf.HPTemplate.RHEL5 Once integrated into /etc/multipath.conf, you can run your system with the device-mapper-multipath tools being shipped with RHEL5. Regards, Volker From arturogf at gmail.com Thu Oct 8 07:43:47 2009 From: arturogf at gmail.com (Arturo Gonzalez Ferrer) Date: Thu, 8 Oct 2009 09:43:47 +0200 Subject: [Linux-cluster] multipath devices and Conga storage tab In-Reply-To: <20091008073708.GA2663@dijkstra> References: <29ae894c0910072320q21d85b0cjffdba0d6ce8b8e9a@mail.gmail.com> <20091008073708.GA2663@dijkstra> Message-ID: 2009/10/8 Volker Dormeyer > Hi, > > On Thu, Oct 08, 2009 at 09:03:52AM +0200, > Arturo Gonzalez Ferrer wrote: > > Uhm, I know, but supposedly it give support for my MSA2312fc ... so i > don't > > know if it would run only with device-mapper-multipath. > > I guess you downloaded the package from > > http://www.hp.com/go/devicemapper > > right? All you need from the package is the template for the multipath > configuration which contains the necessary settings for the HP-arrays. > For RHEL5, this should be: multipath.conf.HPTemplate.RHEL5 > > Once integrated into /etc/multipath.conf, you can run your system with > the device-mapper-multipath tools being shipped with RHEL5. > > Ok, thank you :) > Regards, > Volker > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gianluca.cecchi at gmail.com Thu Oct 8 13:04:20 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 8 Oct 2009 15:04:20 +0200 Subject: [Linux-cluster] Re: Suggestion for backbone network maintenance In-Reply-To: <561c252c0910070803t77c5a56fxcc7ab29c75ba0697@mail.gmail.com> References: <561c252c0910070803t77c5a56fxcc7ab29c75ba0697@mail.gmail.com> Message-ID: <561c252c0910080604j1e238311r4f1b24d3337dccf@mail.gmail.com> On Wed, Oct 7, 2009 at 5:03 PM, Gianluca Cecchi wrote: > Hello, > cluster rh el 5.3 with 2 nodes and a quorum disk with heuristics. The nodes > are in different sites. > At this moment inside cluster.conf I have this: > > log_facility="local4" log_level="7" tko="16" votes="1"> > score="1" tko="20"/> > > > [snip] > > > It seems it doesn't work as I expected.... You have to manually restart qdisk daemon to have it catch the changes. I would expect cluster manager to communicate with it when you do a ccs_tool update.... qdiskd seems not to have a sort of reload function... (based on init script options at least) And also, in my situation, it is better to have both the nodes up'n'running. In fact when you restart qdiskd it actually takes about 2 minutes and 10 seconds to re-register and count as one vote out of three. Some seconds before of this, I get the emergency message where I lost quorum and my services (FS and IP) are suddenly stopped and then restarted when quorum regained..... So the successfull steps are, at least in my case: node 1 and 2 both up and running cluster services node1 - change to cluster.conf incrementing version number and putting tko=1500 - ccs_tool update /etc/cluster/cluster.conf - cman_tool version -r (is this still necessary?????) - service qdiskd restart; sleep 2; service qdiskd start (sometimes due to a bug in qdiskd it doesn't suddenly start, even if you do stop/start; so that for safe I have to put a new start command just after the first attempt... more precisely: bug https://bugzilla.redhat.com/show_bug.cgi?id=485199 I'm in cman 2.0.98-1.el5_3.1 to simulate my prod cluster and this bug seems to be first fixed in rh el 5.4 with cman-2.0.115-1.el5, then superseded a few day after by important fix 2.0.115-1.el5_4.2 ) Anyway, after about 2 minutes and 10 seconds the qdiskd finishes its initialization and synchronises with the other one... Now I can go to node2 and run on it - service qdiskd restart; sleep 2; service qdiskd start This way both the nodes are aligned with qdiskd changes. In my case then I can shutdown node2 and go through waiting network people tell me that maintenance is finished, to re-apply initial configuration.... Comments are again welcome obviously ;-) Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nitin.Choudhary at palm.com Thu Oct 8 13:38:10 2009 From: Nitin.Choudhary at palm.com (Nitin Choudhary) Date: Thu, 8 Oct 2009 06:38:10 -0700 Subject: [Linux-cluster] Node does not join after reboot In-Reply-To: <29ae894c0910061303y34ea2e52n193319c3dc0ffdd6@mail.gmail.com> References: <29ae894c0910061303y34ea2e52n193319c3dc0ffdd6@mail.gmail.com> Message-ID: Hi! We have two node cluster with quorum drive. After we reboot/fence the node it does not rejoin the cluster. We have to stop the cluster and start at the same time. We had this issue in the beginning and then it started working and now again the same problem is experienced. Any idea? Thanks, Nitin -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Thu Oct 8 16:17:23 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Thu, 8 Oct 2009 11:17:23 -0500 Subject: [Linux-cluster] Problem after fence Message-ID: <8b711df40910080917u85f2e2due038e73795eea844@mail.gmail.com> HI, Here is my setup node1: 1 vote node2 : 1 vote node3 : 1 vote qdisk: 3 votes I am using qdisk but not herusitics Yesterday we have a network outage and node1 and node2 were fecned by node 3. But after when node1 and node2 rebooted they were not able to join the cluster. Log says: --- Oct 8 11:12:36 cvtst1 ccsd[5537]: Error while processing connect: Connection refused Oct 8 11:12:36 cvtst1 ccsd[5537]: Cluster is not quorate. Refusing connection. --- cman_status on node1 o/p is: [root at cvtst1 cluster]# cman_tool status Version: 6.1.0 Config Version: 44 Cluster Name: test Cluster Id: 1678 Cluster Member: Yes Cluster Generation: 34984 Membership state: Cluster-Member Nodes: 2 Expected votes: 6 Total votes: 1 Quorum: 4 Activity blocked Active subsystems: 5 Flags: Ports Bound: 0 Node name: cvtst1 Node ID: 2 Multicast addresses: 239.192.6.148 Node addresses: x.x.5.165 I am using GFS2 and qdisk using fibre SAN. It seems like node1 and node2 are not able to add the votes of qdiskd and hence I am getting the problem. But why the votes are not added to the total cluster votes although I am using fibre SAN independent of the network. Thanks Paras. From Piche.Etienne at hydro.qc.ca Fri Oct 9 00:50:48 2009 From: Piche.Etienne at hydro.qc.ca (Piche.Etienne at hydro.qc.ca) Date: Thu, 8 Oct 2009 20:50:48 -0400 Subject: [Linux-cluster] Problem after fence In-Reply-To: <8b711df40910080917u85f2e2due038e73795eea844@mail.gmail.com> References: <8b711df40910080917u85f2e2due038e73795eea844@mail.gmail.com> Message-ID: Hi, I have the same problem with the same config: GFS2 and QDISK on fiber SAN but only 2 nodes ... I resolved it by enabling "qdiskd" deamon on startup : # chkconfig --level 2345 qdiskd on # reboot -fn Even if RH support did not tell me to do it. I'm now asking it on Red Hat Network Support. I can informe you what they will response me. Etienne -----Message d'origine----- De : linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] De la part de Paras pradhan Envoy? : 8 octobre 2009 12:17 ? : linux clustering Objet : [Linux-cluster] Problem after fence HI, Here is my setup node1: 1 vote node2 : 1 vote node3 : 1 vote qdisk: 3 votes I am using qdisk but not herusitics Yesterday we have a network outage and node1 and node2 were fecned by node 3. But after when node1 and node2 rebooted they were not able to join the cluster. Log says: --- Oct 8 11:12:36 cvtst1 ccsd[5537]: Error while processing connect: Connection refused Oct 8 11:12:36 cvtst1 ccsd[5537]: Cluster is not quorate. Refusing connection. --- cman_status on node1 o/p is: [root at cvtst1 cluster]# cman_tool status Version: 6.1.0 Config Version: 44 Cluster Name: test Cluster Id: 1678 Cluster Member: Yes Cluster Generation: 34984 Membership state: Cluster-Member Nodes: 2 Expected votes: 6 Total votes: 1 Quorum: 4 Activity blocked Active subsystems: 5 Flags: Ports Bound: 0 Node name: cvtst1 Node ID: 2 Multicast addresses: 239.192.6.148 Node addresses: x.x.5.165 I am using GFS2 and qdisk using fibre SAN. It seems like node1 and node2 are not able to add the votes of qdiskd and hence I am getting the problem. But why the votes are not added to the total cluster votes although I am using fibre SAN independent of the network. Thanks Paras. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From scooter at cgl.ucsf.edu Fri Oct 9 16:55:06 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Fri, 09 Oct 2009 09:55:06 -0700 Subject: [Linux-cluster] gfs2_tool settune demote_secs Message-ID: <4ACF6AEA.1000901@cgl.ucsf.edu> Hi all, On RHEL 5.3/5.4(?) we had changed the value of demote_secs to significantly improve the performance of our gfs2 filesystem for certain tasks (notably rm -r on large directories). I recently noticed that that tuning value is no longer available (part of a recent update, or part of 5.4?). Can someone tell me what, if anything replaces this? Is it now a mount option, or is there some other way to tune this value? Thanks in advance. -- scooter -------------- next part -------------- A non-text attachment was scrubbed... Name: scooter.vcf Type: text/x-vcard Size: 378 bytes Desc: not available URL: From swhiteho at redhat.com Fri Oct 9 17:01:36 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 09 Oct 2009 18:01:36 +0100 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <4ACF6AEA.1000901@cgl.ucsf.edu> References: <4ACF6AEA.1000901@cgl.ucsf.edu> Message-ID: <1255107696.6052.564.camel@localhost.localdomain> Hi, On Fri, 2009-10-09 at 09:55 -0700, Scooter Morris wrote: > Hi all, > On RHEL 5.3/5.4(?) we had changed the value of demote_secs to > significantly improve the performance of our gfs2 filesystem for certain > tasks (notably rm -r on large directories). I recently noticed that > that tuning value is no longer available (part of a recent update, or > part of 5.4?). Can someone tell me what, if anything replaces this? Is > it now a mount option, or is there some other way to tune this value? > > Thanks in advance. > > -- scooter > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Nothing replaces it. The glocks are disposed of automatically on an LRU basis when there is enough memory pressure to require it. You can alter the amount of memory pressure on the VFS caches (including the glocks) but not specifically the glocks themselves. The idea is that is should be self-tuning now, adjusting itself to the conditions prevailing at the time. If there are any remaining performance issues though, we'd like to know so that they can be addressed, Steve. From linux at alteeve.com Fri Oct 9 17:13:10 2009 From: linux at alteeve.com (Madison Kelly) Date: Fri, 09 Oct 2009 13:13:10 -0400 Subject: [Linux-cluster] OpenAIS issue on CentOS 5.3 Message-ID: <4ACF6F26.6020309@alteeve.com> Hi all, I've been fussing with a test cluster (2-node) for a bit now. I had it working, but I had very little luck with test failure and recovery. So I decided to start over and follow the "Redhat" way. Specifically, I was following along with their "Configuring and Managing a Red Hat Cluster; Red Hat Cluster for Red Hat Enterprise 5" PDF. I've gotten to the point where, using luci, the cluster was built. However, the nodes haven't joined and trying to use 'have node join cluster' fails and generates the following in '/var/log/messages': --------------------------------------------- Oct 9 13:15:45 vsh02 luci[22301]: Unable to retrieve batch 531050721 status from vsh02.canadaequity.com:11111: module scheduled for execution Oct 9 13:15:46 vsh02 ccsd[24724]: Unable to connect to cluster infrastructure after 154350 seconds. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] AIS Executive Service: started and ready to provide service. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Using default multicast address of 239.192.119.37 Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cpg loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cfg loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais configuration service' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_msg loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais message service B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_lck loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais distributed locking service B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_evt loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais event service B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_ckpt loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais checkpoint service B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_amf loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais availability management framework B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_clm loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais cluster membership service B.01.01' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_evs loaded. Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler 'openais extended virtual synchrony service' Oct 9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cman loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] AIS Executive Service RELEASE 'subrev 1358 version 0.80.3' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Copyright (C) 2002-2006 MontaVista Software, Inc and contributors. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Copyright (C) 2006 Red Hat, Inc. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] AIS Executive Service: started and ready to provide service. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Using default multicast address of 239.192.119.37 Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cpg loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais cluster closed process group service v1.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cfg loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais configuration service' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_msg loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais message service B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_lck loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais distributed locking service B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_evt loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais event service B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_ckpt loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais checkpoint service B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_amf loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais availability management framework B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_clm loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais cluster membership service B.01.01' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_evs loaded. Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler 'openais extended virtual synchrony service' Oct 9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cman loaded. Oct 9 13:15:51 vsh02 luci[22301]: Unable to retrieve batch 531050721 status from vsh02.canadaequity.com:11111: service cman start failed: --------------------------------------------- When I try to start 'cman' from the command line, I get this error: --------------------------------------------- # service cman start Starting cluster: Enabling workaround for Xend bridged networking... done Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... failed /usr/sbin/cman_tool: aisexec daemon didn't start --------------------------------------------- This generates the same MAIN: openais errors. My cluster is pretty simple; - Two ASUS servers with three NICs each. One dedicated to a DRBD link. - IPMI for fencing - LVM running on DRBD (no SAN, I'm afraid) Any insight into what I might be doing wrong? Thanks! Madi From pradhanparas at gmail.com Fri Oct 9 17:17:17 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Fri, 9 Oct 2009 12:17:17 -0500 Subject: [Linux-cluster] OpenAIS issue on CentOS 5.3 In-Reply-To: <4ACF6F26.6020309@alteeve.com> References: <4ACF6F26.6020309@alteeve.com> Message-ID: <8b711df40910091017x3afc2cafjd9249dc6b0e93333@mail.gmail.com> See this thread http://www.mail-archive.com/linux-cluster at redhat.com/msg07020.html You need do downgrade CMAN Paras. On Fri, Oct 9, 2009 at 12:13 PM, Madison Kelly wrote: > Hi all, > > ?I've been fussing with a test cluster (2-node) for a bit now. I had it > working, but I had very little luck with test failure and recovery. So I > decided to start over and follow the "Redhat" way. Specifically, I was > following along with their "Configuring and Managing a Red Hat Cluster; Red > Hat Cluster for Red Hat Enterprise 5" PDF. > > ?I've gotten to the point where, using luci, the cluster was built. However, > the nodes haven't joined and trying to use 'have node join cluster' fails > and generates the following in '/var/log/messages': > > --------------------------------------------- > Oct ?9 13:15:45 vsh02 luci[22301]: Unable to retrieve batch 531050721 status > from vsh02.canadaequity.com:11111: module scheduled for execution > Oct ?9 13:15:46 vsh02 ccsd[24724]: Unable to connect to cluster > infrastructure after 154350 seconds. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] AIS Executive Service RELEASE > 'subrev 1358 version 0.80.3' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Copyright (C) 2002-2006 > MontaVista Software, Inc and contributors. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Copyright (C) 2006 Red Hat, > Inc. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] AIS Executive Service: started > and ready to provide service. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Using default multicast > address of 239.192.119.37 > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cpg > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais cluster closed process group service v1.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cfg > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais configuration service' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_msg > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais message service B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_lck > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais distributed locking service B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_evt > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais event service B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_ckpt > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais checkpoint service B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_amf > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais availability management framework B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_clm > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais cluster membership service B.01.01' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_evs > loaded. > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] Registering service handler > 'openais extended virtual synchrony service' > Oct ?9 13:15:47 vsh02 openais[31632]: [MAIN ] openais component openais_cman > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] AIS Executive Service RELEASE > 'subrev 1358 version 0.80.3' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Copyright (C) 2002-2006 > MontaVista Software, Inc and contributors. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Copyright (C) 2006 Red Hat, > Inc. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] AIS Executive Service: started > and ready to provide service. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Using default multicast > address of 239.192.119.37 > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cpg > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais cluster closed process group service v1.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cfg > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais configuration service' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_msg > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais message service B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_lck > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais distributed locking service B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_evt > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais event service B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_ckpt > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais checkpoint service B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_amf > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais availability management framework B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_clm > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais cluster membership service B.01.01' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_evs > loaded. > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] Registering service handler > 'openais extended virtual synchrony service' > Oct ?9 13:15:49 vsh02 openais[31682]: [MAIN ] openais component openais_cman > loaded. > Oct ?9 13:15:51 vsh02 luci[22301]: Unable to retrieve batch 531050721 status > from vsh02.canadaequity.com:11111: service cman start failed: > --------------------------------------------- > > ?When I try to start 'cman' from the command line, I get this error: > > --------------------------------------------- > # service cman start > Starting cluster: > ? Enabling workaround for Xend bridged networking... done > ? Loading modules... done > ? Mounting configfs... done > ? Starting ccsd... done > ? Starting cman... failed > /usr/sbin/cman_tool: aisexec daemon didn't start > --------------------------------------------- > > ?This generates the same MAIN: openais errors. > > ?My cluster is pretty simple; > - Two ASUS servers with three NICs each. One dedicated to a DRBD link. > - IPMI for fencing > - LVM running on DRBD (no SAN, I'm afraid) > > ?Any insight into what I might be doing wrong? > > Thanks! > > Madi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From kbphillips80 at gmail.com Fri Oct 9 17:46:48 2009 From: kbphillips80 at gmail.com (Kaerka Phillips) Date: Fri, 9 Oct 2009 13:46:48 -0400 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <1255107696.6052.564.camel@localhost.localdomain> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> Message-ID: If in gfs2 glocks are purged based upon memory constraints, what happens if it is run on a box with large amounts of memory? i.e. RHEL5.x with 128gb ram? We ended up having to move away from GFS2 due to serious performance issues with this exact setup, and our performance issues were largely centered around commands like ls or rm against gfs2 filesystems with large directory structures and millions of files in them. In our case, something as simple as copying a whole filesystem to another filesystem would cause a load avg of 50 or more, and would take 8+ hours to complete. The same thing on NFS or ext3 would take usually 1 to 2 hours. Netbackup of 10 of those filesystems took ~40 hours to complete, so we were getting maybe 1 good backup per week, and in some cases the backup itself caused cluster crash. We are still using our GFS1 clusters, since as long as their network is stable, their performance is very good, but we are phasing out most of our GFS2 clusters to NFS instead. On Fri, Oct 9, 2009 at 1:01 PM, Steven Whitehouse wrote: > Hi, > > On Fri, 2009-10-09 at 09:55 -0700, Scooter Morris wrote: > > Hi all, > > On RHEL 5.3/5.4(?) we had changed the value of demote_secs to > > significantly improve the performance of our gfs2 filesystem for certain > > tasks (notably rm -r on large directories). I recently noticed that > > that tuning value is no longer available (part of a recent update, or > > part of 5.4?). Can someone tell me what, if anything replaces this? Is > > it now a mount option, or is there some other way to tune this value? > > > > Thanks in advance. > > > > -- scooter > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Nothing replaces it. The glocks are disposed of automatically on an LRU > basis when there is enough memory pressure to require it. You can alter > the amount of memory pressure on the VFS caches (including the glocks) > but not specifically the glocks themselves. > > The idea is that is should be self-tuning now, adjusting itself to the > conditions prevailing at the time. If there are any remaining > performance issues though, we'd like to know so that they can be > addressed, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scooter at cgl.ucsf.edu Fri Oct 9 17:57:14 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Fri, 09 Oct 2009 10:57:14 -0700 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> Message-ID: <4ACF797A.8070609@cgl.ucsf.edu> Steve, Thanks for the prompt reply. Like Kaerka, I'm running on large-memory servers and decreasing demote_secs from 300 to 20 resulted in significant performance improvements because locks get freed much more quickly (I assume), resulting in much better response. It could certainly be that changing demote_secs was a workaround for a different bug that has now been fixed, which would be great. I'll try some tests today and see how "rm -rf" on a large directory behaves. -- scooter Kaerka Phillips wrote: > If in gfs2 glocks are purged based upon memory constraints, what > happens if it is run on a box with large amounts of memory? i.e. > RHEL5.x with 128gb ram? We ended up having to move away from GFS2 due > to serious performance issues with this exact setup, and our > performance issues were largely centered around commands like ls or rm > against gfs2 filesystems with large directory structures and millions > of files in them. > > In our case, something as simple as copying a whole filesystem to > another filesystem would cause a load avg of 50 or more, and would > take 8+ hours to complete. The same thing on NFS or ext3 would take > usually 1 to 2 hours. Netbackup of 10 of those filesystems took ~40 > hours to complete, so we were getting maybe 1 good backup per week, > and in some cases the backup itself caused cluster crash. > > We are still using our GFS1 clusters, since as long as their network > is stable, their performance is very good, but we are phasing out most > of our GFS2 clusters to NFS instead. > > On Fri, Oct 9, 2009 at 1:01 PM, Steven Whitehouse > wrote: > > Hi, > > On Fri, 2009-10-09 at 09:55 -0700, Scooter Morris wrote: > > Hi all, > > On RHEL 5.3/5.4(?) we had changed the value of demote_secs to > > significantly improve the performance of our gfs2 filesystem for > certain > > tasks (notably rm -r on large directories). I recently noticed that > > that tuning value is no longer available (part of a recent > update, or > > part of 5.4?). Can someone tell me what, if anything replaces > this? Is > > it now a mount option, or is there some other way to tune this > value? > > > > Thanks in advance. > > > > -- scooter > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > Nothing replaces it. The glocks are disposed of automatically on > an LRU > basis when there is enough memory pressure to require it. You can > alter > the amount of memory pressure on the VFS caches (including the glocks) > but not specifically the glocks themselves. > > The idea is that is should be self-tuning now, adjusting itself to the > conditions prevailing at the time. If there are any remaining > performance issues though, we'd like to know so that they can be > addressed, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > ------------------------------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: scooter.vcf Type: text/x-vcard Size: 378 bytes Desc: not available URL: From linux at alteeve.com Fri Oct 9 18:20:32 2009 From: linux at alteeve.com (Madison Kelly) Date: Fri, 09 Oct 2009 14:20:32 -0400 Subject: [Linux-cluster] OpenAIS issue on CentOS 5.3 In-Reply-To: <8b711df40910091017x3afc2cafjd9249dc6b0e93333@mail.gmail.com> References: <4ACF6F26.6020309@alteeve.com> <8b711df40910091017x3afc2cafjd9249dc6b0e93333@mail.gmail.com> Message-ID: <4ACF7EF0.2000003@alteeve.com> Paras pradhan wrote: > See this thread > > http://www.mail-archive.com/linux-cluster at redhat.com/msg07020.html > > You need do downgrade CMAN > > Paras. Paras, you are awesome, thank you. For the record though, instead of downgrading cman, I installed an updated version of openais: http://people.centos.org/z00dax/misc/c53/x86_64/RPMS/openais-0.80.6-8.el5.x86_64.rpm This seemed to fix the problem and will save me from having to lock the old version of cman. Thanks! Madi From scooter at cgl.ucsf.edu Fri Oct 9 21:35:26 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Fri, 09 Oct 2009 14:35:26 -0700 Subject: [Linux-cluster] gfs2 quotas Message-ID: <4ACFAC9E.6010606@cgl.ucsf.edu> Hi all, I'm trying to set up quotas on my gfs2 filesystems, and I'm running into some problems. I enabled quotas (quota=on), did a gfs2_quota reset and a gfs2_quota init. Now, I'm getting warning messages: warning: quota file size not a multiple of struct gfs2_quota and Warning: This filesystem doesn't seem to have the new quota list format or the quota list is corrupt. list, check and init operation performance will suffer due to this. It is recommended that you run the 'gfs2_quota reset' operation to reset the quota file. All current quota information will be lost and you will have to reassign all quota limits and warnings I've tried doing a reset/init/check, fsck.gfs2, etc., but I still get the same warnings. Am I doing something wrong? RHEL 5.4: Linux 2.6.18-164.el5 gfs2-utils-0.1.62-1.el5 -- scooter -------------- next part -------------- A non-text attachment was scrubbed... Name: scooter.vcf Type: text/x-vcard Size: 378 bytes Desc: not available URL: From adas at redhat.com Fri Oct 9 22:15:19 2009 From: adas at redhat.com (Abhijith Das) Date: Fri, 9 Oct 2009 18:15:19 -0400 (EDT) Subject: [Linux-cluster] gfs2 quotas In-Reply-To: <4ACFAC9E.6010606@cgl.ucsf.edu> Message-ID: <660723029.1824891255126519705.JavaMail.root@zmail05.collab.prod.int.phx2.redhat.com> Hi Scooter, The 'gfs2_quota reset' command should truncate the hidden quota file down to just root uid/gid quotas and these warnings should cease. Seems like that's not happening for you. I'd like to take a look at the hidden quota file (if possible, the filesystem itself) to see what's going on. Please open a RH bugzilla for this problem. If the filesystem is not too big and you can share it, please compress+upload it someplace and send me the link to it. If you can't share the filesystem, I'd like to atleast look at the hidden quota file. Please do the following to retrieve that. I must warn you that this *might* hang/crash the filesystem, and you might have to restart your machine. 1. Mount the gfs2 filesystem (say at /mnt/gfs2) 2. Mount the gfs2meta filesystem - mount -t gfs2meta /mnt/gfs2 /tmp/.gfs2meta 3. ls -l /tmp/.gfs2meta/quota Check that the above file is not unreasonably large (there used to be a known issue where this file could get huge (100s of GB). It doesn't actually use all that space as it's a sparse file. I believe this was fixed at one point, I just don't know if you have the fix or not) 4. If the above file is small, just do 'cp /tmp/.gfs2meta/ /tmp/foo' 5. umount /tmp/.gfs2meta && umount /mnt/gfs2 Please compress and email me foo. Thanks! --Abhi ----- Original Message ----- From: "Scooter Morris" To: "linux clustering" Sent: Friday, October 9, 2009 4:35:26 PM GMT -06:00 US/Canada Central Subject: [Linux-cluster] gfs2 quotas Hi all, I'm trying to set up quotas on my gfs2 filesystems, and I'm running into some problems. I enabled quotas (quota=on), did a gfs2_quota reset and a gfs2_quota init. Now, I'm getting warning messages: warning: quota file size not a multiple of struct gfs2_quota and Warning: This filesystem doesn't seem to have the new quota list format or the quota list is corrupt. list, check and init operation performance will suffer due to this. It is recommended that you run the 'gfs2_quota reset' operation to reset the quota file. All current quota information will be lost and you will have to reassign all quota limits and warnings I've tried doing a reset/init/check, fsck.gfs2, etc., but I still get the same warnings. Am I doing something wrong? RHEL 5.4: Linux 2.6.18-164.el5 gfs2-utils-0.1.62-1.el5 -- scooter -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Sat Oct 10 19:01:46 2009 From: linux at alteeve.com (Madison Kelly) Date: Sat, 10 Oct 2009 15:01:46 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI Message-ID: <4AD0DA1A.8020804@alteeve.com> Hi all, Until now, I've been building 2-node clusters using DRBD+LVM for the shared storage. I've been teaching myself clustering, so I don't have a world of capital to sink into hardware at the moment. I would like to start getting some experience with 3+ nodes using a central SAN disk. So I've been pricing out the minimal hardware for a four-node cluster and have something to start with. My current hiccup though is the SAN side. I've searched around, but have not been able to get a clear answer. Is it possible to build a host machine (CentOS/Debian) to have a simple MD device and make it available to the cluster nodes as an iSCSI/SAN device? Being a learning exercise, I am not too worried about speed or redundancy (beyond testing failure types and recovery). Thanks for any insight, advice, pointers! Madi From linux-cluster at lists.grepular.com Sat Oct 10 19:07:02 2009 From: linux-cluster at lists.grepular.com (Mike Cardwell) Date: Sat, 10 Oct 2009 20:07:02 +0100 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0DA1A.8020804@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> Message-ID: <4AD0DB56.4080303@lists.grepular.com> Madison Kelly wrote: > Until now, I've been building 2-node clusters using DRBD+LVM for the > shared storage. I've been teaching myself clustering, so I don't have a > world of capital to sink into hardware at the moment. I would like to > start getting some experience with 3+ nodes using a central SAN disk. > > So I've been pricing out the minimal hardware for a four-node cluster > and have something to start with. My current hiccup though is the SAN > side. I've searched around, but have not been able to get a clear answer. > > Is it possible to build a host machine (CentOS/Debian) to have a > simple MD device and make it available to the cluster nodes as an > iSCSI/SAN device? Being a learning exercise, I am not too worried about > speed or redundancy (beyond testing failure types and recovery). Yeah, that's possible. Just use iscsid to export the device. If this is just for testing/learning purposes have you considered using virtual machines to minimise the hardware footprint? You could have a single host machine that acts as the SAN, exporting a device using iscsid and three vm's running on top of VMWare server on the same machine which make up the cluster... -- Mike Cardwell - IT Consultant and LAMP developer Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/ Technical Blog: https://secure.grepular.com/blog/ From linux at alteeve.com Sat Oct 10 19:21:29 2009 From: linux at alteeve.com (Madison Kelly) Date: Sat, 10 Oct 2009 15:21:29 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0DB56.4080303@lists.grepular.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> Message-ID: <4AD0DEB9.9010605@alteeve.com> Mike Cardwell wrote: > Madison Kelly wrote: > >> Until now, I've been building 2-node clusters using DRBD+LVM for the >> shared storage. I've been teaching myself clustering, so I don't have >> a world of capital to sink into hardware at the moment. I would like >> to start getting some experience with 3+ nodes using a central SAN disk. >> >> So I've been pricing out the minimal hardware for a four-node >> cluster and have something to start with. My current hiccup though is >> the SAN side. I've searched around, but have not been able to get a >> clear answer. >> >> Is it possible to build a host machine (CentOS/Debian) to have a >> simple MD device and make it available to the cluster nodes as an >> iSCSI/SAN device? Being a learning exercise, I am not too worried >> about speed or redundancy (beyond testing failure types and recovery). > > Yeah, that's possible. Just use iscsid to export the device. If this is > just for testing/learning purposes have you considered using virtual > machines to minimise the hardware footprint? You could have a single > host machine that acts as the SAN, exporting a device using iscsid and > three vm's running on top of VMWare server on the same machine which > make up the cluster... Thanks! I was thinking that was what I could do, but I wanted to ask before sinking a lot of time/money just to find out I was wrong. :) I thought about Xen VMs. I'll have to see if I can simulate things like fence devices and such. Though, as good as virtualization is, I wonder how close I could get to simulating real world? When I run into problems, it would be another layer to wonder about. However, there is no denying the cost savings! I will look into that more. Madi From andrew at ntsg.umt.edu Sat Oct 10 19:32:34 2009 From: andrew at ntsg.umt.edu (Andrew A. Neuschwander) Date: Sat, 10 Oct 2009 13:32:34 -0600 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0DA1A.8020804@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> Message-ID: <4AD0E152.6040205@ntsg.umt.edu> Madison Kelly wrote: > Hi all, > > Until now, I've been building 2-node clusters using DRBD+LVM for the > shared storage. I've been teaching myself clustering, so I don't have a > world of capital to sink into hardware at the moment. I would like to > start getting some experience with 3+ nodes using a central SAN disk. > > So I've been pricing out the minimal hardware for a four-node cluster > and have something to start with. My current hiccup though is the SAN > side. I've searched around, but have not been able to get a clear answer. > > Is it possible to build a host machine (CentOS/Debian) to have a > simple MD device and make it available to the cluster nodes as an > iSCSI/SAN device? Being a learning exercise, I am not too worried about > speed or redundancy (beyond testing failure types and recovery). > > Thanks for any insight, advice, pointers! > > Madi > If you want to use a Linux host as a iscsi 'server' (a target in iscsi terminiology), you can use IET, the iSCSI Enterprise Target: http://iscsitarget.sourceforge.net/. I've used it and it works well, but it is a little CPU hungry. Obviously, you don't get the benefits of a hardware SAN, but you don't get the cost either. -Andrew -- Andrew A. Neuschwander, RHCE Systems/Software Engineer College of Forestry and Conservation The University of Montana http://www.ntsg.umt.edu andrew at ntsg.umt.edu - 406.243.6310 > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From linux at alteeve.com Sat Oct 10 19:41:33 2009 From: linux at alteeve.com (Madison Kelly) Date: Sat, 10 Oct 2009 15:41:33 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0E152.6040205@ntsg.umt.edu> References: <4AD0DA1A.8020804@alteeve.com> <4AD0E152.6040205@ntsg.umt.edu> Message-ID: <4AD0E36D.6090408@alteeve.com> Andrew A. Neuschwander wrote: > Madison Kelly wrote: >> Hi all, >> >> Until now, I've been building 2-node clusters using DRBD+LVM for the >> shared storage. I've been teaching myself clustering, so I don't have >> a world of capital to sink into hardware at the moment. I would like >> to start getting some experience with 3+ nodes using a central SAN disk. >> >> So I've been pricing out the minimal hardware for a four-node >> cluster and have something to start with. My current hiccup though is >> the SAN side. I've searched around, but have not been able to get a >> clear answer. >> >> Is it possible to build a host machine (CentOS/Debian) to have a >> simple MD device and make it available to the cluster nodes as an >> iSCSI/SAN device? Being a learning exercise, I am not too worried >> about speed or redundancy (beyond testing failure types and recovery). >> >> Thanks for any insight, advice, pointers! >> >> Madi >> > > If you want to use a Linux host as a iscsi 'server' (a target in iscsi > terminiology), you can use IET, the iSCSI Enterprise Target: > http://iscsitarget.sourceforge.net/. I've used it and it works well, but > it is a little CPU hungry. Obviously, you don't get the benefits of a > hardware SAN, but you don't get the cost either. > > -Andrew Thanks, Andrew! I'll go look at that now. I was planning on building my SAN server on an core2duo-based system with 2GB of RAM. I figured that the server will do nothing but host/handle the SAN/iSCSI stuff, so the CPU consumption should be fine. Is there a way to quantify the "CPU/Memory hungry"-ness of running a SAN box? Ie: what does a given read/write/etc call "cost"? As an aside, beyond hot-swap/bandwidth/quality, what generally is the "advantage" of dedicated SAN/iSCSI hardware vs. white box roll-your-own? Thanks again! Madi From andrew at ntsg.umt.edu Sat Oct 10 19:55:33 2009 From: andrew at ntsg.umt.edu (Andrew A. Neuschwander) Date: Sat, 10 Oct 2009 13:55:33 -0600 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0E36D.6090408@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0E152.6040205@ntsg.umt.edu> <4AD0E36D.6090408@alteeve.com> Message-ID: <4AD0E6B5.50407@ntsg.umt.edu> Madison Kelly wrote: > Andrew A. Neuschwander wrote: >> Madison Kelly wrote: >>> Hi all, >>> >>> Until now, I've been building 2-node clusters using DRBD+LVM for >>> the shared storage. I've been teaching myself clustering, so I don't >>> have a world of capital to sink into hardware at the moment. I would >>> like to start getting some experience with 3+ nodes using a central >>> SAN disk. >>> >>> So I've been pricing out the minimal hardware for a four-node >>> cluster and have something to start with. My current hiccup though is >>> the SAN side. I've searched around, but have not been able to get a >>> clear answer. >>> >>> Is it possible to build a host machine (CentOS/Debian) to have a >>> simple MD device and make it available to the cluster nodes as an >>> iSCSI/SAN device? Being a learning exercise, I am not too worried >>> about speed or redundancy (beyond testing failure types and recovery). >>> >>> Thanks for any insight, advice, pointers! >>> >>> Madi >>> >> >> If you want to use a Linux host as a iscsi 'server' (a target in iscsi >> terminiology), you can use IET, the iSCSI Enterprise Target: >> http://iscsitarget.sourceforge.net/. I've used it and it works well, >> but it is a little CPU hungry. Obviously, you don't get the benefits >> of a hardware SAN, but you don't get the cost either. >> >> -Andrew > > Thanks, Andrew! I'll go look at that now. > > I was planning on building my SAN server on an core2duo-based system > with 2GB of RAM. I figured that the server will do nothing but > host/handle the SAN/iSCSI stuff, so the CPU consumption should be fine. > Is there a way to quantify the "CPU/Memory hungry"-ness of running a SAN > box? Ie: what does a given read/write/etc call "cost"? > > As an aside, beyond hot-swap/bandwidth/quality, what generally is the > "advantage" of dedicated SAN/iSCSI hardware vs. white box roll-your-own? > > Thanks again! > > Madi > I think what makes being an iSCSI target CPU hungry is that it is handling a block layer protocol in user space. So while what it does is fairly simple (i.e. no filesystem), it has to do a lot of it. Storage performance is usually discussed in IOPS (I/Os Per Second), but when rolling my own, I just throw enough spindles/raid/cpu/memory at it saturate a GigE link and call it a day. I've not used a hardware iSCSI SAN, just FC. The biggest benefits, in my mind, of something like an EMC Clariion are the fully redundant hardware path and the fast fabric. Hmm, I may be getting off-topic here. Sorry about that. -Andrew > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From corey.kovacs at gmail.com Sat Oct 10 20:05:32 2009 From: corey.kovacs at gmail.com (Corey Kovacs) Date: Sat, 10 Oct 2009 21:05:32 +0100 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0E36D.6090408@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0E152.6040205@ntsg.umt.edu> <4AD0E36D.6090408@alteeve.com> Message-ID: <7d6e8da40910101305p60f0e824r86d1928fd2c17edc@mail.gmail.com> One thing to keep in mind is the fact that iscsi is a tcp based protocol. So even though your machine might be doing nothing but acting as an iscsi target, it's going to take the brunt of the load in handling the tcp stack. If you can get a network card that handles iscsi on the card itself, that will help loads. Otherwise your cpu might dig a hole for itself to crawl into. Of course if your just messing about, or only using the iscsi targets locally, then your probably ok. Benefits of a dedicated device are management capabilities, throughput, flexible location, etc. Fibre channel is 8Gb standard now and SAN's are starting to use it instead of 4Gb, but the entry point in terms of cost is high. A fully loaded EVA8100 can cost 250k, the FC infrastructure can go to 60-80k easily. iscsi really needs to have a seperate back end storage network to be useful and it should be 10Gb. I hear people say it's useful on slower hardware but everyone has an opinion. I guess if your just using it for system volumes and low IO then 1G might be fine. Anyway hope this help and if it doesnt' at least it might give you more to think about. Best of luck Corey On Sat, Oct 10, 2009 at 8:41 PM, Madison Kelly wrote: > Andrew A. Neuschwander wrote: > >> Madison Kelly wrote: >> >>> Hi all, >>> >>> Until now, I've been building 2-node clusters using DRBD+LVM for the >>> shared storage. I've been teaching myself clustering, so I don't have a >>> world of capital to sink into hardware at the moment. I would like to start >>> getting some experience with 3+ nodes using a central SAN disk. >>> >>> So I've been pricing out the minimal hardware for a four-node cluster >>> and have something to start with. My current hiccup though is the SAN side. >>> I've searched around, but have not been able to get a clear answer. >>> >>> Is it possible to build a host machine (CentOS/Debian) to have a simple >>> MD device and make it available to the cluster nodes as an iSCSI/SAN device? >>> Being a learning exercise, I am not too worried about speed or redundancy >>> (beyond testing failure types and recovery). >>> >>> Thanks for any insight, advice, pointers! >>> >>> Madi >>> >>> >> If you want to use a Linux host as a iscsi 'server' (a target in iscsi >> terminiology), you can use IET, the iSCSI Enterprise Target: >> http://iscsitarget.sourceforge.net/. I've used it and it works well, but >> it is a little CPU hungry. Obviously, you don't get the benefits of a >> hardware SAN, but you don't get the cost either. >> >> -Andrew >> > > Thanks, Andrew! I'll go look at that now. > > I was planning on building my SAN server on an core2duo-based system with > 2GB of RAM. I figured that the server will do nothing but host/handle the > SAN/iSCSI stuff, so the CPU consumption should be fine. Is there a way to > quantify the "CPU/Memory hungry"-ness of running a SAN box? Ie: what does a > given read/write/etc call "cost"? > > As an aside, beyond hot-swap/bandwidth/quality, what generally is the > "advantage" of dedicated SAN/iSCSI hardware vs. white box roll-your-own? > > Thanks again! > > Madi > > -- > > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pasik at iki.fi Sun Oct 11 11:47:58 2009 From: pasik at iki.fi (Pasi =?iso-8859-1?Q?K=E4rkk=E4inen?=) Date: Sun, 11 Oct 2009 14:47:58 +0300 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <7d6e8da40910101305p60f0e824r86d1928fd2c17edc@mail.gmail.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0E152.6040205@ntsg.umt.edu> <4AD0E36D.6090408@alteeve.com> <7d6e8da40910101305p60f0e824r86d1928fd2c17edc@mail.gmail.com> Message-ID: <20091011114758.GV1434@reaktio.net> On Sat, Oct 10, 2009 at 09:05:32PM +0100, Corey Kovacs wrote: > One thing to keep in mind is the fact that iscsi is a tcp based protocol. > So even though your machine might be doing nothing but acting as an iscsi > target, it's going to take the brunt of the load in handling the tcp > stack. If you can get a network card that handles iscsi on the card > itself, that will help loads. Otherwise your cpu might dig a hole for > itself to crawl into. > I don't think iSCSI HBA drivers to use in the _target_ are publicly available. iSCSI HBA's in the initiator (client) are supported of course. Then again most NICs nowadays offload TCP/IP, and most can also offload iSCSI.. HBAs are getting legacy stuff. > Of course if your just messing about, or only using the iscsi targets > locally, then your probably ok. > > Benefits of a dedicated device are management capabilities, throughput, > flexible location, etc. Fibre channel is 8Gb standard now and SAN's are > starting to use it instead of 4Gb, but the entry point in terms of cost is > high. A fully loaded EVA8100 can cost 250k, the FC infrastructure can go > to 60-80k easily. iscsi really needs to have a seperate back end storage > network to be useful and it should be 10Gb. I hear people say it's useful > on slower hardware but everyone has an opinion. I guess if your just using > it for system volumes and low IO then 1G might be fine. > 1G iSCSI works very well for many workloads, depending mostly on your storage/target setup. 1G link can handle a lot of random IOs.. you're most probably limited by the amount of disk spindles anyway. FC is getting legacy aswell.. IMHO :) -- Pasi > Anyway hope this help and if it doesnt' at least it might give you more to > think about. > > Best of luck > > Corey > > On Sat, Oct 10, 2009 at 8:41 PM, Madison Kelly <[1]linux at alteeve.com> > wrote: > > Andrew A. Neuschwander wrote: > > Madison Kelly wrote: > > Hi all, > > Until now, I've been building 2-node clusters using DRBD+LVM for > the shared storage. I've been teaching myself clustering, so I don't > have a world of capital to sink into hardware at the moment. I would > like to start getting some experience with 3+ nodes using a central > SAN disk. > > So I've been pricing out the minimal hardware for a four-node > cluster and have something to start with. My current hiccup though > is the SAN side. I've searched around, but have not been able to get > a clear answer. > > Is it possible to build a host machine (CentOS/Debian) to have a > simple MD device and make it available to the cluster nodes as an > iSCSI/SAN device? Being a learning exercise, I am not too worried > about speed or redundancy (beyond testing failure types and > recovery). > > Thanks for any insight, advice, pointers! > > Madi > > If you want to use a Linux host as a iscsi 'server' (a target in iscsi > terminiology), you can use IET, the iSCSI Enterprise Target: > [2]http://iscsitarget.sourceforge.net/. I've used it and it works > well, but it is a little CPU hungry. Obviously, you don't get the > benefits of a hardware SAN, but you don't get the cost either. > > -Andrew > > Thanks, Andrew! I'll go look at that now. > > I was planning on building my SAN server on an core2duo-based system > with 2GB of RAM. I figured that the server will do nothing but > host/handle the SAN/iSCSI stuff, so the CPU consumption should be fine. > Is there a way to quantify the "CPU/Memory hungry"-ness of running a SAN > box? Ie: what does a given read/write/etc call "cost"? > > As an aside, beyond hot-swap/bandwidth/quality, what generally is the > "advantage" of dedicated SAN/iSCSI hardware vs. white box roll-your-own? > > Thanks again! > > Madi > -- > Linux-cluster mailing list > [3]Linux-cluster at redhat.com > [4]https://www.redhat.com/mailman/listinfo/linux-cluster > > References > > Visible links > 1. mailto:linux at alteeve.com > 2. http://iscsitarget.sourceforge.net/ > 3. mailto:Linux-cluster at redhat.com > 4. https://www.redhat.com/mailman/listinfo/linux-cluster > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From rvandolson at esri.com Sun Oct 11 13:22:56 2009 From: rvandolson at esri.com (Ray Van Dolson) Date: Sun, 11 Oct 2009 06:22:56 -0700 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0DEB9.9010605@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> Message-ID: <20091011132256.GA6461@esri.com> On Sat, Oct 10, 2009 at 12:21:29PM -0700, Madison Kelly wrote: > Mike Cardwell wrote: > > Madison Kelly wrote: > > > >> Until now, I've been building 2-node clusters using DRBD+LVM for the > >> shared storage. I've been teaching myself clustering, so I don't have > >> a world of capital to sink into hardware at the moment. I would like > >> to start getting some experience with 3+ nodes using a central SAN disk. > >> > >> So I've been pricing out the minimal hardware for a four-node > >> cluster and have something to start with. My current hiccup though is > >> the SAN side. I've searched around, but have not been able to get a > >> clear answer. > >> > >> Is it possible to build a host machine (CentOS/Debian) to have a > >> simple MD device and make it available to the cluster nodes as an > >> iSCSI/SAN device? Being a learning exercise, I am not too worried > >> about speed or redundancy (beyond testing failure types and recovery). > > > > Yeah, that's possible. Just use iscsid to export the device. If this is > > just for testing/learning purposes have you considered using virtual > > machines to minimise the hardware footprint? You could have a single > > host machine that acts as the SAN, exporting a device using iscsid and > > three vm's running on top of VMWare server on the same machine which > > make up the cluster... > > Thanks! I was thinking that was what I could do, but I wanted to ask > before sinking a lot of time/money just to find out I was wrong. :) > > I thought about Xen VMs. I'll have to see if I can simulate things like > fence devices and such. Though, as good as virtualization is, I wonder > how close I could get to simulating real world? When I run into > problems, it would be another layer to wonder about. However, there is > no denying the cost savings! I will look into that more. Another option, at least with VMware, would be to create a shared disk that can be seen by all your VM's. A bit simpler than setting up iSCSI, though that would be a good thing to learn in it of itself... Ray From kkovachev at varna.net Mon Oct 12 10:07:13 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 12 Oct 2009 13:07:13 +0300 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <1255107696.6052.564.camel@localhost.localdomain> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> Message-ID: <20091012094530.M26441@varna.net> Hi, On Fri, 09 Oct 2009 18:01:36 +0100, Steven Whitehouse wrote > Hi, > > The idea is that is should be self-tuning now, adjusting itself to the > conditions prevailing at the time. If there are any remaining > performance issues though, we'd like to know so that they can be > addressed, > I have noticed a possible performance issue while experimenting with ping_pong, but the test is representing normal operation. The setup: 3 node cluster (Node1, Node2 and Node3) with shared GFS2 partition 1. Starting ping_pong on one of the nodes (Node1) i get several thousands (30k+) of locks per second 2. Stopping it after a while and immediately starting (moving) it to Node2 (just like a shared service resource after failover) the number of locks goes below 2000 Probably because the locks are held on Node1, but then even after hours it does not go back to 30k+ locks per second and stays at <2000 3. Stopping ping_pong on Node2 and starting it again on the same or another node (Node1 or Node3) after 10-20min there are again 30k+ locks per second Not sure if demote_secs would help, because i can't test, but it would be great to have the locks released from Node1 to Node2 after some time at step 2. not 3. > Steve. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From swhiteho at redhat.com Mon Oct 12 10:07:44 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 12 Oct 2009 11:07:44 +0100 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <4ACF797A.8070609@cgl.ucsf.edu> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> <4ACF797A.8070609@cgl.ucsf.edu> Message-ID: <1255342064.2675.43.camel@localhost.localdomain> Hi, On Fri, 2009-10-09 at 10:57 -0700, Scooter Morris wrote: > Steve, > Thanks for the prompt reply. Like Kaerka, I'm running on > large-memory servers and decreasing demote_secs from 300 to 20 > resulted in significant performance improvements because locks get > freed much more quickly (I assume), resulting in much better response. > It could certainly be that changing demote_secs was a workaround for a > different bug that has now been fixed, which would be great. I'll try > some tests today and see how "rm -rf" on a large directory behaves. > > -- scooter > The question though, is why that should result in a better response. It doesn't really make sense, since the caching of the "locks" (really caching of data and metadata controlled by a lock) should improve the performance due to more time to write out the dirty data. Doing an "rm -fr" is also a very different workload to that of reading all the files in the filesystem once (for backup purposes for example) since the "rm -fr" requires writing to the fs and the backup process doesn't do any writing. How long it takes to remove a file also depends to a large extent on its size. In both cases, however it would improve performance if you could arrange to remove, or read inodes in inode number order. Both GFS and GFS2 return inodes from getdents64 (readdir) in a pseudo-random order based on the hash of the filename. You can gain a lot of performance if these results are sorted before they are scanned. Ideally we'd return them from the fs in sorted order. Unfortunately a design decision which was made a long time ago which, in combination with the design of the Linux VFS prevents us from doing that. If there is a problem with a node caching the whole filesystem after it has been scanned, then it is still possible to solve this issue: echo 3 > /proc/sys/vm/drop_caches I guess I should also point out that it is a good idea to mount with the noatime mount option if there is going to be a read-only scan of the complete filesystem on a regular basis, since that will prevent that becoming a "write to every inode" scan. That will also make a big performance difference. Note that its ok (in recent kernels) to mount a GFS2 filesystem more than once with different atime flags (using bind mounts) in case you have an application which requires atime, but you want to avoid it when running a back up. There is also /proc/sys/vm/vfs_cache_pressure as well, which may help optimise your workload. ... and if all that fails, then the next thing to do is to use blktrace/seekwatcher to find out whats really going on, on the disk and send the results so that we can have a look and see if we can improve the disk I/O. Better still if you can combine that with a trace from the gfs2 tracepoints so we can see the locking at the same time, Steve. > Kaerka Phillips wrote: > > If in gfs2 glocks are purged based upon memory constraints, what > > happens if it is run on a box with large amounts of memory? i.e. > > RHEL5.x with 128gb ram? We ended up having to move away from GFS2 > > due to serious performance issues with this exact setup, and our > > performance issues were largely centered around commands like ls or > > rm against gfs2 filesystems with large directory structures and > > millions of files in them. > > > > In our case, something as simple as copying a whole filesystem to > > another filesystem would cause a load avg of 50 or more, and would > > take 8+ hours to complete. The same thing on NFS or ext3 would take > > usually 1 to 2 hours. Netbackup of 10 of those filesystems took ~40 > > hours to complete, so we were getting maybe 1 good backup per week, > > and in some cases the backup itself caused cluster crash. > > > > We are still using our GFS1 clusters, since as long as their network > > is stable, their performance is very good, but we are phasing out > > most of our GFS2 clusters to NFS instead. > > > > On Fri, Oct 9, 2009 at 1:01 PM, Steven Whitehouse > > wrote: > > Hi, > > > > On Fri, 2009-10-09 at 09:55 -0700, Scooter Morris wrote: > > > Hi all, > > > On RHEL 5.3/5.4(?) we had changed the value of > > demote_secs to > > > significantly improve the performance of our gfs2 > > filesystem for certain > > > tasks (notably rm -r on large directories). I recently > > noticed that > > > that tuning value is no longer available (part of a recent > > update, or > > > part of 5.4?). Can someone tell me what, if anything > > replaces this? Is > > > it now a mount option, or is there some other way to tune > > this value? > > > > > > Thanks in advance. > > > > > > -- scooter > > > > > > > > -- > > > Linux-cluster mailing list > > > Linux-cluster at redhat.com > > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > Nothing replaces it. The glocks are disposed of > > automatically on an LRU > > basis when there is enough memory pressure to require it. > > You can alter > > the amount of memory pressure on the VFS caches (including > > the glocks) > > but not specifically the glocks themselves. > > > > The idea is that is should be self-tuning now, adjusting > > itself to the > > conditions prevailing at the time. If there are any > > remaining > > performance issues though, we'd like to know so that they > > can be > > addressed, > > > > Steve. > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > ____________________________________________________________________ > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From swhiteho at redhat.com Mon Oct 12 10:14:08 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 12 Oct 2009 11:14:08 +0100 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <20091012094530.M26441@varna.net> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> <20091012094530.M26441@varna.net> Message-ID: <1255342448.2675.47.camel@localhost.localdomain> Hi, On Mon, 2009-10-12 at 13:07 +0300, Kaloyan Kovachev wrote: > Hi, > > On Fri, 09 Oct 2009 18:01:36 +0100, Steven Whitehouse wrote > > Hi, > > > > The idea is that is should be self-tuning now, adjusting itself to the > > conditions prevailing at the time. If there are any remaining > > performance issues though, we'd like to know so that they can be > > addressed, > > > > I have noticed a possible performance issue while experimenting with > ping_pong, but the test is representing normal operation. > The ping_pong test uses fcntl() locks. These go through dlm_controld and are independent of the filesystem, whether you are using GFS/GFS2 (or maybe even OCFS2 now as well). So these are not the same as the glocks that the last message was referring to. > The setup: > 3 node cluster (Node1, Node2 and Node3) with shared GFS2 partition > > 1. Starting ping_pong on one of the nodes (Node1) i get several thousands > (30k+) of locks per second > > 2. Stopping it after a while and immediately starting (moving) it to Node2 > (just like a shared service resource after failover) the number of locks goes > below 2000 > Probably because the locks are held on Node1, but then even after hours it > does not go back to 30k+ locks per second and stays at <2000 > > 3. Stopping ping_pong on Node2 and starting it again on the same or another > node (Node1 or Node3) after 10-20min there are again 30k+ locks per second > > Not sure if demote_secs would help, because i can't test, but it would be > great to have the locks released from Node1 to Node2 after some time at step > 2. not 3. > What options have you got in your cluster.conf relating to plocks? What kernel are you using? Steve. From kkovachev at varna.net Mon Oct 12 11:06:23 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 12 Oct 2009 14:06:23 +0300 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <1255342448.2675.47.camel@localhost.localdomain> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> <20091012094530.M26441@varna.net> <1255342448.2675.47.camel@localhost.localdomain> Message-ID: <20091012105805.M52875@varna.net> On Mon, 12 Oct 2009 11:14:08 +0100, Steven Whitehouse wrote > Hi, > > On Mon, 2009-10-12 at 13:07 +0300, Kaloyan Kovachev wrote: > > Hi, > > > > On Fri, 09 Oct 2009 18:01:36 +0100, Steven Whitehouse wrote > > > Hi, > > > > > > The idea is that is should be self-tuning now, adjusting itself to the > > > conditions prevailing at the time. If there are any remaining > > > performance issues though, we'd like to know so that they can be > > > addressed, > > > > > > > I have noticed a possible performance issue while experimenting with > > ping_pong, but the test is representing normal operation. > > > The ping_pong test uses fcntl() locks. These go through dlm_controld and > are independent of the filesystem, whether you are using GFS/GFS2 (or > maybe even OCFS2 now as well). So these are not the same as the glocks > that the last message was referring to. > > > The setup: > > 3 node cluster (Node1, Node2 and Node3) with shared GFS2 partition > > > > 1. Starting ping_pong on one of the nodes (Node1) i get several thousands > > (30k+) of locks per second > > > > 2. Stopping it after a while and immediately starting (moving) it to Node2 > > (just like a shared service resource after failover) the number of locks goes > > below 2000 > > Probably because the locks are held on Node1, but then even after hours it > > does not go back to 30k+ locks per second and stays at <2000 > > > > 3. Stopping ping_pong on Node2 and starting it again on the same or another > > node (Node1 or Node3) after 10-20min there are again 30k+ locks per second > > > > Not sure if demote_secs would help, because i can't test, but it would be > > great to have the locks released from Node1 to Node2 after some time at step > > 2. not 3. > > > What options have you got in your cluster.conf relating to plocks? What > kernel are you using? > As i am just testing there is still nothing in cluster.conf except the nodes definition, so they should be the defaults. The kernel is 2.6.31.1 > Steve. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kkovachev at varna.net Mon Oct 12 11:16:31 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Mon, 12 Oct 2009 14:16:31 +0300 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0E36D.6090408@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0E152.6040205@ntsg.umt.edu> <4AD0E36D.6090408@alteeve.com> Message-ID: <20091012105211.M54720@varna.net> Hi, On Sat, 10 Oct 2009 15:41:33 -0400, Madison Kelly wrote > Andrew A. Neuschwander wrote: > > Madison Kelly wrote: > >> Hi all, > >> > >> Until now, I've been building 2-node clusters using DRBD+LVM for the > >> shared storage. I've been teaching myself clustering, so I don't have > >> a world of capital to sink into hardware at the moment. I would like > >> to start getting some experience with 3+ nodes using a central SAN disk. > >> > >> So I've been pricing out the minimal hardware for a four-node > >> cluster and have something to start with. My current hiccup though is > >> the SAN side. I've searched around, but have not been able to get a > >> clear answer. > >> > >> Is it possible to build a host machine (CentOS/Debian) to have a > >> simple MD device and make it available to the cluster nodes as an > >> iSCSI/SAN device? Being a learning exercise, I am not too worried > >> about speed or redundancy (beyond testing failure types and recovery). > >> > >> Thanks for any insight, advice, pointers! > >> > >> Madi > >> > > > > If you want to use a Linux host as a iscsi 'server' (a target in iscsi > > terminiology), you can use IET, the iSCSI Enterprise Target: > > http://iscsitarget.sourceforge.net/. I've used it and it works well, but > > it is a little CPU hungry. Obviously, you don't get the benefits of a > > hardware SAN, but you don't get the cost either. > > In addition to IET and iSCSI client you may want to take a look at multipath too. I am also testing a home-brew SAN setup with two storage machines and replicated via DRBD GFS2 partition (Primary/Primary) which in turn is exported from both via iSCSI and then imported on the nodes with multipath to both storages - the idea is to have the performance from both and no single point of failure for the storage too. > > -Andrew > > Thanks, Andrew! I'll go look at that now. > > I was planning on building my SAN server on an core2duo-based system > with 2GB of RAM. I figured that the server will do nothing but > host/handle the SAN/iSCSI stuff, so the CPU consumption should be fine. > Is there a way to quantify the "CPU/Memory hungry"-ness of running a SAN > box? Ie: what does a given read/write/etc call "cost"? > > As an aside, beyond hot-swap/bandwidth/quality, what generally is the > "advantage" of dedicated SAN/iSCSI hardware vs. white box roll-your-own? > > Thanks again! > > Madi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From scooter at cgl.ucsf.edu Mon Oct 12 12:57:54 2009 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Mon, 12 Oct 2009 05:57:54 -0700 Subject: [Linux-cluster] gfs2_tool settune demote_secs In-Reply-To: <1255342064.2675.43.camel@localhost.localdomain> References: <4ACF6AEA.1000901@cgl.ucsf.edu> <1255107696.6052.564.camel@localhost.localdomain> <4ACF797A.8070609@cgl.ucsf.edu> <1255342064.2675.43.camel@localhost.localdomain> Message-ID: <4AD327D2.8010401@cgl.ucsf.edu> Steve, Thanks for the informative, and detailed response -- it really helps to understand what might be happening. We're not mounting with noatime, and it sounds like that would be a good first step. Thanks! -- scooter Steven Whitehouse wrote: > Hi, > > On Fri, 2009-10-09 at 10:57 -0700, Scooter Morris wrote: > >> Steve, >> Thanks for the prompt reply. Like Kaerka, I'm running on >> large-memory servers and decreasing demote_secs from 300 to 20 >> resulted in significant performance improvements because locks get >> freed much more quickly (I assume), resulting in much better response. >> It could certainly be that changing demote_secs was a workaround for a >> different bug that has now been fixed, which would be great. I'll try >> some tests today and see how "rm -rf" on a large directory behaves. >> >> -- scooter >> >> > The question though, is why that should result in a better response. It > doesn't really make sense, since the caching of the "locks" (really > caching of data and metadata controlled by a lock) should improve the > performance due to more time to write out the dirty data. > > Doing an "rm -fr" is also a very different workload to that of reading > all the files in the filesystem once (for backup purposes for example) > since the "rm -fr" requires writing to the fs and the backup process > doesn't do any writing. > > How long it takes to remove a file also depends to a large extent on its > size. > > In both cases, however it would improve performance if you could arrange > to remove, or read inodes in inode number order. Both GFS and GFS2 > return inodes from getdents64 (readdir) in a pseudo-random order based > on the hash of the filename. You can gain a lot of performance if these > results are sorted before they are scanned. > > Ideally we'd return them from the fs in sorted order. Unfortunately a > design decision which was made a long time ago which, in combination > with the design of the Linux VFS prevents us from doing that. > > If there is a problem with a node caching the whole filesystem after it > has been scanned, then it is still possible to solve this issue: > > echo 3 > /proc/sys/vm/drop_caches > > I guess I should also point out that it is a good idea to mount with the > noatime mount option if there is going to be a read-only scan of the > complete filesystem on a regular basis, since that will prevent that > becoming a "write to every inode" scan. That will also make a big > performance difference. Note that its ok (in recent kernels) to mount a > GFS2 filesystem more than once with different atime flags (using bind > mounts) in case you have an application which requires atime, but you > want to avoid it when running a back up. > > There is also /proc/sys/vm/vfs_cache_pressure as well, which may help > optimise your workload. > > ... and if all that fails, then the next thing to do is to use > blktrace/seekwatcher to find out whats really going on, on the disk and > send the results so that we can have a look and see if we can improve > the disk I/O. Better still if you can combine that with a trace from the > gfs2 tracepoints so we can see the locking at the same time, > > Steve. > > >> Kaerka Phillips wrote: >> >>> If in gfs2 glocks are purged based upon memory constraints, what >>> happens if it is run on a box with large amounts of memory? i.e. >>> RHEL5.x with 128gb ram? We ended up having to move away from GFS2 >>> due to serious performance issues with this exact setup, and our >>> performance issues were largely centered around commands like ls or >>> rm against gfs2 filesystems with large directory structures and >>> millions of files in them. >>> >>> In our case, something as simple as copying a whole filesystem to >>> another filesystem would cause a load avg of 50 or more, and would >>> take 8+ hours to complete. The same thing on NFS or ext3 would take >>> usually 1 to 2 hours. Netbackup of 10 of those filesystems took ~40 >>> hours to complete, so we were getting maybe 1 good backup per week, >>> and in some cases the backup itself caused cluster crash. >>> >>> We are still using our GFS1 clusters, since as long as their network >>> is stable, their performance is very good, but we are phasing out >>> most of our GFS2 clusters to NFS instead. >>> >>> On Fri, Oct 9, 2009 at 1:01 PM, Steven Whitehouse >>> wrote: >>> Hi, >>> >>> On Fri, 2009-10-09 at 09:55 -0700, Scooter Morris wrote: >>> > Hi all, >>> > On RHEL 5.3/5.4(?) we had changed the value of >>> demote_secs to >>> > significantly improve the performance of our gfs2 >>> filesystem for certain >>> > tasks (notably rm -r on large directories). I recently >>> noticed that >>> > that tuning value is no longer available (part of a recent >>> update, or >>> > part of 5.4?). Can someone tell me what, if anything >>> replaces this? Is >>> > it now a mount option, or is there some other way to tune >>> this value? >>> > >>> > Thanks in advance. >>> > >>> > -- scooter >>> > >>> >>> > -- >>> > Linux-cluster mailing list >>> > Linux-cluster at redhat.com >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> Nothing replaces it. The glocks are disposed of >>> automatically on an LRU >>> basis when there is enough memory pressure to require it. >>> You can alter >>> the amount of memory pressure on the VFS caches (including >>> the glocks) >>> but not specifically the glocks themselves. >>> >>> The idea is that is should be self-tuning now, adjusting >>> itself to the >>> conditions prevailing at the time. If there are any >>> remaining >>> performance issues though, we'd like to know so that they >>> can be >>> addressed, >>> >>> Steve. >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> ____________________________________________________________________ >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Oct 12 15:30:25 2009 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 12 Oct 2009 11:30:25 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <20091011132256.GA6461@esri.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> Message-ID: <1255361425.29242.60.camel@localhost.localdomain> On Sun, 2009-10-11 at 06:22 -0700, Ray Van Dolson wrote: > Another option, at least with VMware, would be to create a shared disk > that can be seen by all your VM's. > > A bit simpler than setting up iSCSI, though that would be a good thing > to learn in it of itself... ... or to just do the same with KVM or Xen, which doesn't require spending money on VMWare and is part of RHEL... -- Lon From lhh at redhat.com Mon Oct 12 15:31:19 2009 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 12 Oct 2009 11:31:19 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <1255361425.29242.60.camel@localhost.localdomain> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> Message-ID: <1255361479.29242.62.camel@localhost.localdomain> On Mon, 2009-10-12 at 11:30 -0400, Lon Hohberger wrote: > On Sun, 2009-10-11 at 06:22 -0700, Ray Van Dolson wrote: > > > Another option, at least with VMware, would be to create a shared disk > > that can be seen by all your VM's. > > > > A bit simpler than setting up iSCSI, though that would be a good thing > > to learn in it of itself... > > ... or to just do the same with KVM or Xen, which doesn't require > spending money on VMWare and is part of RHEL... (and Fedora... d'oh!) -- Lon From rvandolson at esri.com Mon Oct 12 15:47:25 2009 From: rvandolson at esri.com (Ray Van Dolson) Date: Mon, 12 Oct 2009 08:47:25 -0700 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <1255361425.29242.60.camel@localhost.localdomain> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> Message-ID: <20091012154725.GA30269@esri.com> On Mon, Oct 12, 2009 at 08:30:25AM -0700, Lon Hohberger wrote: > On Sun, 2009-10-11 at 06:22 -0700, Ray Van Dolson wrote: > > > Another option, at least with VMware, would be to create a shared disk > > that can be seen by all your VM's. > > > > A bit simpler than setting up iSCSI, though that would be a good thing > > to learn in it of itself... > > ... or to just do the same with KVM or Xen, which doesn't require > spending money on VMWare and is part of RHEL... Of course. And ESXi is free as well -- use whichever virtualization tech you prefer.. Ray From linux-cluster at lists.grepular.com Mon Oct 12 15:57:45 2009 From: linux-cluster at lists.grepular.com (Mike Cardwell) Date: Mon, 12 Oct 2009 16:57:45 +0100 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <20091012154725.GA30269@esri.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> <20091012154725.GA30269@esri.com> Message-ID: <4AD351F9.5050401@lists.grepular.com> Ray Van Dolson wrote: >>> Another option, at least with VMware, would be to create a shared disk >>> that can be seen by all your VM's. >>> >>> A bit simpler than setting up iSCSI, though that would be a good thing >>> to learn in it of itself... >> ... or to just do the same with KVM or Xen, which doesn't require >> spending money on VMWare and is part of RHEL... > > Of course. And ESXi is free as well -- use whichever virtualization > tech you prefer.. As well as ESXi, VMWare server is also completely free. -- Mike Cardwell - IT Consultant and LAMP developer Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/ Technical Blog: https://secure.grepular.com/blog/ From raju.rajsand at gmail.com Mon Oct 12 16:14:44 2009 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 12 Oct 2009 21:44:44 +0530 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD351F9.5050401@lists.grepular.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> <20091012154725.GA30269@esri.com> <4AD351F9.5050401@lists.grepular.com> Message-ID: <8786b91c0910120914t5794a8c5n7cfe28fe6d5cf84a@mail.gmail.com> Greetings, On Mon, Oct 12, 2009 at 9:27 PM, Mike Cardwell wrote: > Ray Van Dolson wrote: > >>>> Another option, at least with VMware, would be to create a shared disk >>>> that can be seen by all your VM's. >>>> >>>> A bit simpler than setting up iSCSI, though that would be a good thing >>>> to learn in it of itself... >>> >>> ... or to just do the same with KVM or Xen, which doesn't require >>> spending money on VMWare and is part of RHEL... >> >> Of course. ?And ESXi is free as well -- use whichever virtualization >> tech you prefer.. > > As well as ESXi, VMWare server is also completely free. > Watch out for VMFS .... Would prefer GFS though and on a shared storage of course... Incidentally, has anybody worked on a _live_ cluster controlled KVM image (_not_ xen -- there is a how to for that in redhat magazine) which automagically shifts as any cluster controlled resource would do -- like VIP, apace et. al.??? Regards, From brem.belguebli at gmail.com Mon Oct 12 16:49:00 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 12 Oct 2009 18:49:00 +0200 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <8786b91c0910120914t5794a8c5n7cfe28fe6d5cf84a@mail.gmail.com> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> <20091012154725.GA30269@esri.com> <4AD351F9.5050401@lists.grepular.com> <8786b91c0910120914t5794a8c5n7cfe28fe6d5cf84a@mail.gmail.com> Message-ID: <29ae894c0910120949s45cd12atbeaac27372f5914a@mail.gmail.com> Check this (https://www.redhat.com/archives/linux-cluster/2009-October/msg00037.html) out, my comments. No underlying GFS or other clustered FS This is a test cluster but works fine. The only missing thing, in comparison to an ESX/HyperV thing is the management GUI (VCENTER or SCVMM), forcing you to command line pilot your VM's. Virt-manager or any libvirt tool (convirture, ovirt...) not being RH cluster aware, and I got allergic to conga ..... ;-) 2009/10/12, Rajagopal Swaminathan : > Greetings, > > On Mon, Oct 12, 2009 at 9:27 PM, Mike Cardwell > wrote: > > Ray Van Dolson wrote: > > > >>>> Another option, at least with VMware, would be to create a shared disk > >>>> that can be seen by all your VM's. > >>>> > >>>> A bit simpler than setting up iSCSI, though that would be a good thing > >>>> to learn in it of itself... > >>> > >>> ... or to just do the same with KVM or Xen, which doesn't require > >>> spending money on VMWare and is part of RHEL... > >> > >> Of course. And ESXi is free as well -- use whichever virtualization > >> tech you prefer.. > > > > As well as ESXi, VMWare server is also completely free. > > > > > Watch out for VMFS .... > > Would prefer GFS though and on a shared storage of course... > > Incidentally, has anybody worked on a _live_ cluster controlled KVM > image (_not_ xen -- there is a how to for that in redhat magazine) > which automagically shifts as any cluster controlled resource would do > -- like VIP, apace et. al.??? > > Regards, > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From linux at alteeve.com Mon Oct 12 19:32:14 2009 From: linux at alteeve.com (Madison Kelly) Date: Mon, 12 Oct 2009 15:32:14 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <1255361425.29242.60.camel@localhost.localdomain> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> Message-ID: <4AD3843E.70306@alteeve.com> Lon Hohberger wrote: > On Sun, 2009-10-11 at 06:22 -0700, Ray Van Dolson wrote: > >> Another option, at least with VMware, would be to create a shared disk >> that can be seen by all your VM's. >> >> A bit simpler than setting up iSCSI, though that would be a good thing >> to learn in it of itself... > > ... or to just do the same with KVM or Xen, which doesn't require > spending money on VMWare and is part of RHEL... > > -- Lon lol, thanks! I've actually switched away from VMWare to Xen as of last year. I can see why VMWare was great before a few years ago, but these days I find Xen to be better with the right hardware. Specifically, a multi-core CPU that supports virtualization. As for the other types of data stores, I've been using DRBD + LVM in Primary/Primary mode already. It was tricky to get going in the first place, but I got it going and now feel that iSCSI/SAN is the next step. :) Madi From linux at alteeve.com Mon Oct 12 19:33:42 2009 From: linux at alteeve.com (Madison Kelly) Date: Mon, 12 Oct 2009 15:33:42 -0400 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <1255361479.29242.62.camel@localhost.localdomain> References: <4AD0DA1A.8020804@alteeve.com> <4AD0DB56.4080303@lists.grepular.com> <4AD0DEB9.9010605@alteeve.com> <20091011132256.GA6461@esri.com> <1255361425.29242.60.camel@localhost.localdomain> <1255361479.29242.62.camel@localhost.localdomain> Message-ID: <4AD38496.8090605@alteeve.com> Lon Hohberger wrote: > On Mon, 2009-10-12 at 11:30 -0400, Lon Hohberger wrote: >> On Sun, 2009-10-11 at 06:22 -0700, Ray Van Dolson wrote: >> >>> Another option, at least with VMware, would be to create a shared disk >>> that can be seen by all your VM's. >>> >>> A bit simpler than setting up iSCSI, though that would be a good thing >>> to learn in it of itself... >> ... or to just do the same with KVM or Xen, which doesn't require >> spending money on VMWare and is part of RHEL... > > (and Fedora... d'oh!) > > -- Lon Heh, CentOS I find to be pretty good. As a Debian/Ubuntu user, I've got to say that it hurts to say that, though. ;) If only I could get apt working on CentOS... Then I'd be very happy. Madi From jpalmae at gmail.com Tue Oct 13 01:00:10 2009 From: jpalmae at gmail.com (Jorge Palma) Date: Mon, 12 Oct 2009 22:00:10 -0300 Subject: [Linux-cluster] Home-brew SAN/iSCSI In-Reply-To: <4AD0DA1A.8020804@alteeve.com> References: <4AD0DA1A.8020804@alteeve.com> Message-ID: <5b65f1b10910121800j2648d701ifed75c12ab3642e6@mail.gmail.com> Yeah, try with opensolaris + comstar, The another implementation of iscsi target and more! Regards 2009/10/10, Madison Kelly : > Hi all, > > Until now, I've been building 2-node clusters using DRBD+LVM for the > shared storage. I've been teaching myself clustering, so I don't have a > world of capital to sink into hardware at the moment. I would like to > start getting some experience with 3+ nodes using a central SAN disk. > > So I've been pricing out the minimal hardware for a four-node cluster > and have something to start with. My current hiccup though is the SAN > side. I've searched around, but have not been able to get a clear answer. > > Is it possible to build a host machine (CentOS/Debian) to have a > simple MD device and make it available to the cluster nodes as an > iSCSI/SAN device? Being a learning exercise, I am not too worried about > speed or redundancy (beyond testing failure types and recovery). > > Thanks for any insight, advice, pointers! > > Madi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Jorge Palma Escobar Ingeniero de Sistemas Red Hat Linux Certified Engineer Certificate N? 804005089418233 From jakov.sosic at srce.hr Tue Oct 13 09:37:23 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 13 Oct 2009 11:37:23 +0200 Subject: [Linux-cluster] How to monitor secondary link? Message-ID: <20091013113723.51a8e0dc@pc-jsosic.srce.hr> Hi. I have the following problem (RHEL v5.3) with linux cluster, 3 node cluster configuration with qdisk. I have two interfaces - bond0 and bond1. Bond0 is used for cluster communication (heartbeat) and virtual ip addresses for services plus it's the interface that's bridged for Xen machines. Bond1 is used only for accessing iSCSI NAS storage. NAS is configured with CLVM. Problem arises when link to NAS is down (ifdown bond1). Cluster doesn't relocate Xen's to the other nodes, so it's obvious that cluster does not monitor bond1 link. I have three ideas how to solve this, but I'm not too happy about neither of them. 1. heuristic pings in qdisk configuration. I'm not too happy about this one because I would had to add all three nodes to heuristics, and tune the ping ratio... And maybe when there would be bursts and those links too occupied, some ping packets could get lost and get me in trouble. 2. move heartbeat over to bond1 That would get me in new trouble, because I'm doing the fencing through bond0, so... :-/ And also, again there's a potentially problem when services start to really push the storages and traffic to storage goes up. 3. adding 3 virtual IP addresses on bond1 network, and adding them to some services, each service running on one of the nodes. I don't like this because it's a "hack" and not a real solution :-/ But it seems to be the best choice for the moment... Problem is I also didn't test this one, and this might resolve in IP relocating, and Xen's staying in place :-/ Any ideas? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From baishuwei at gmail.com Tue Oct 13 09:57:42 2009 From: baishuwei at gmail.com (Bai Shuwei) Date: Tue, 13 Oct 2009 17:57:42 +0800 Subject: [Linux-cluster] hwo to build a virtual FC-san environment Message-ID: All: I want to build a viratual FC-SAN environment for learning. But I don't know whether there are some useful documents and tools for it. Hoping get you help on it. Best Regards Bai Shuwei -- Love other people, as same as love yourself! Don't think all the time, do it by your hands! Personal URL: http://dslab.lzu.edu.cn:8080/members/baishw/ E-Mail: baishuwei at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlopmart at gmail.com Tue Oct 13 10:22:34 2009 From: carlopmart at gmail.com (carlopmart) Date: Tue, 13 Oct 2009 12:22:34 +0200 Subject: [Linux-cluster] using NFS as a shared storage for RHCS Message-ID: <4AD454EA.2090802@gmail.com> Hi all, Due to limitations and performance problems that contanins GFS and GFS2, I think to use a OpenSolaris NFS server (with ZFS) to serve shared storage for three cluster nodes using RHEL5.4. Somebody have tried this type of configruation?? any special recommendations?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com From jakov.sosic at srce.hr Tue Oct 13 12:42:45 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 13 Oct 2009 14:42:45 +0200 Subject: [Linux-cluster] How to monitor secondary link? In-Reply-To: <20091013113723.51a8e0dc@pc-jsosic.srce.hr> References: <20091013113723.51a8e0dc@pc-jsosic.srce.hr> Message-ID: <20091013144245.78531cad@pc-jsosic.srce.hr> On Tue, 13 Oct 2009 11:37:23 +0200 Jakov Sosic wrote: > I don't like this because it's a "hack" and not a real solution :-/ > But it seems to be the best choice for the moment... Problem is I also > didn't test this one, and this might resolve in IP relocating, and > Xen's staying in place :-/ Currently I've solved it with qdisk heuristics, but I wonder, how would you do this if you don't have a qdisk? Quorumd without label specified, that only leans on heuristics? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From brem.belguebli at gmail.com Tue Oct 13 12:45:59 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 13 Oct 2009 14:45:59 +0200 Subject: [Linux-cluster] fencing question, self stonith/suicide as last resort fencing Message-ID: <29ae894c0910130545t41306a0ds8b9dc8ace458e90@mail.gmail.com> Hi, This discussion may have occured many times here, but I'd like to reintroduce it. I know a lot of you Guru guys won't agree with the principle but wouldn't that be something that could be implemented and reserved for specific cases (multisite clusters for instance). The idea being, in theses specific multisite cluster cases, if a network partition occurs, no legacy fencing method will success, preventing the services to failover to the sane cluster partition. One solution (it seems to be implemented with different known cluster stacks) is to trigger a hard reboot or a panic (this would be prefered) after legacy fencing times out on the nodes who transitionned from quorate to inquorate. This would allow the sane partition to recover the services from the lost partition after the legacy fencing methods have timed out + the (configurable) self fencing timer (a few minutes). Regards From schlegel at riege.com Tue Oct 13 13:33:35 2009 From: schlegel at riege.com (Gunther Schlegel) Date: Tue, 13 Oct 2009 15:33:35 +0200 Subject: [Linux-cluster] OT: how to monitor VLAN membership Message-ID: <4AD481AF.1000204@riege.com> Hi, this is a bit OT, but definitely cluster-related. I am looking for a way to test the switch VLAN configuration from my RHEL5.4 cluster nodes. The cluster is running Virtual Machines as services, there is a network bridge for each VLAN configured on the each of the nodes. The cluster nodes itself have no need to access the VM's VLANs, so the bridges do not have any IP assigned to it. As the VMs may be migrated between the nodes, it is crucial that every node may use any required VLAN. While we do monitor the switch configuration, we came accross a broken switch, which seemed to work but did not forward some VLANs although the configuration was correct. I would like the nodes to test whether they have access to a given VLAN themselves to recognize that situation. As there are no IPs assigned to the bridges, simple ping-style tests are not an option. I tried to find something based on arp, but have not been successful yet. Any hint is highly appreciated. best regards, Gunther -- Gunther Schlegel Manager IT Infrastructure ............................................................. Riege Software International GmbH Fon: +49 (2159) 9148 0 Mollsfeld 10 Fax: +49 (2159) 9148 11 40670 Meerbusch Web: www.riege.com Germany E-Mail: schlegel at riege.com --- --- Handelsregister: Managing Directors: Amtsgericht Neuss HRB-NR 4207 Christian Riege USt-ID-Nr.: DE120585842 Gabriele Riege Johannes Riege ............................................................. YOU CARE FOR FREIGHT, WE CARE FOR YOU -------------- next part -------------- A non-text attachment was scrubbed... Name: schlegel.vcf Type: text/x-vcard Size: 346 bytes Desc: not available URL: From raju.rajsand at gmail.com Tue Oct 13 14:53:48 2009 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Tue, 13 Oct 2009 20:23:48 +0530 Subject: [Linux-cluster] OT: how to monitor VLAN membership In-Reply-To: <4AD481AF.1000204@riege.com> References: <4AD481AF.1000204@riege.com> Message-ID: <8786b91c0910130753v25953b06kd7983625ea22d0a4@mail.gmail.com> Greetings, In addition what suggestions others give, just make sure that the cluster hearbeat ports on the switch is configured to multicast. Regards, Rajagopal On 10/13/09, Gunther Schlegel wrote: > Hi, > > this is a bit OT, but definitely cluster-related. > > > I am looking for a way to test the switch VLAN configuration from my > RHEL5.4 cluster nodes. > > The cluster is running Virtual Machines as services, there is a network > bridge for each VLAN configured on the each of the nodes. The cluster > nodes itself have no need to access the VM's VLANs, so the bridges do > not have any IP assigned to it. As the VMs may be migrated between the > nodes, it is crucial that every node may use any required VLAN. > > While we do monitor the switch configuration, we came accross a broken > switch, which seemed to work but did not forward some VLANs although the > configuration was correct. I would like the nodes to test whether they > have access to a given VLAN themselves to recognize that situation. > > As there are no IPs assigned to the bridges, simple ping-style tests are > not an option. I tried to find something based on arp, but have not been > successful yet. > > Any hint is highly appreciated. > > > best regards, Gunther > > -- > Gunther Schlegel > Manager IT Infrastructure > > > ............................................................. > Riege Software International GmbH Fon: +49 (2159) 9148 0 > Mollsfeld 10 Fax: +49 (2159) 9148 11 > 40670 Meerbusch Web: www.riege.com > Germany E-Mail: schlegel at riege.com > --- --- > Handelsregister: Managing Directors: > Amtsgericht Neuss HRB-NR 4207 Christian Riege > USt-ID-Nr.: DE120585842 Gabriele Riege > Johannes Riege > ............................................................. > YOU CARE FOR FREIGHT, WE CARE FOR YOU > > > > From brem.belguebli at gmail.com Tue Oct 13 15:19:30 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 13 Oct 2009 17:19:30 +0200 Subject: [Linux-cluster] OT: how to monitor VLAN membership In-Reply-To: <4AD481AF.1000204@riege.com> References: <4AD481AF.1000204@riege.com> Message-ID: <29ae894c0910130819l1321ab3ao3495eaf65394c0fc@mail.gmail.com> Hi, I can't see how your problem could be solved. Your wrong switch, as long as it has ports UP in a given Vlan (your physical hosts) , wether or not this Vlan is well propagated to its adjacent switches, will consider the Vlan being UP (not pruned). Every host (physical) connected to this switch will see each other (at MAC level and IP for your VM's) but not the ones connected to other switches on the same Vlan. >From what I can guess, your switch is connected to other ones thru trunks (802.1q Vlan tagged links) but this specific VlanID was not propagated between all the switches. This is generally due to a misconfiguration, ie on IOS cisco based switches switchport trunk allowed vlan 1, x, y except your specific VM Vlan. The only way to address this would be to have a L3 (IP) interface configured on your core switches (not on the leaf ones where your hosts are physically connected) on which the Vlan is present and test it thru ping. Brem 2009/10/13, Gunther Schlegel : > Hi, > > this is a bit OT, but definitely cluster-related. > > > I am looking for a way to test the switch VLAN configuration from my RHEL5.4 > cluster nodes. > > The cluster is running Virtual Machines as services, there is a network > bridge for each VLAN configured on the each of the nodes. The cluster nodes > itself have no need to access the VM's VLANs, so the bridges do not have any > IP assigned to it. As the VMs may be migrated between the nodes, it is > crucial that every node may use any required VLAN. > > While we do monitor the switch configuration, we came accross a broken > switch, which seemed to work but did not forward some VLANs although the > configuration was correct. I would like the nodes to test whether they have > access to a given VLAN themselves to recognize that situation. > > As there are no IPs assigned to the bridges, simple ping-style tests are not > an option. I tried to find something based on arp, but have not been > successful yet. > > Any hint is highly appreciated. > > > best regards, Gunther > > -- > Gunther Schlegel > Manager IT Infrastructure > > > ............................................................. > Riege Software International GmbH Fon: +49 (2159) 9148 0 > Mollsfeld 10 Fax: +49 (2159) 9148 11 > 40670 Meerbusch Web: www.riege.com > Germany E-Mail: schlegel at riege.com > --- --- > Handelsregister: Managing Directors: > Amtsgericht Neuss HRB-NR 4207 Christian Riege > USt-ID-Nr.: DE120585842 Gabriele Riege > Johannes Riege > ............................................................. > YOU CARE FOR FREIGHT, WE CARE FOR YOU > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > From jeff.sturm at eprize.com Tue Oct 13 15:18:31 2009 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Tue, 13 Oct 2009 11:18:31 -0400 Subject: [Linux-cluster] using NFS as a shared storage for RHCS In-Reply-To: <4AD454EA.2090802@gmail.com> References: <4AD454EA.2090802@gmail.com> Message-ID: <64D0546C5EBBD147B75DE133D798665F03F3ECF7@hugo.eprize.local> > -----Original Message----- > Due to limitations and performance problems that contanins GFS and GFS2, I think > to use a OpenSolaris NFS server (with ZFS) to serve shared storage for three cluster > nodes using RHEL5.4. > > Somebody have tried this type of configruation?? any special recommendations?? You're probably not going to get any good recommendations unless you can tell us something about your application. GFS and GFS2 (like OCFS) are clustered filesystems that work at a layer above shared block storage (iSCSI, FC, etc.) SAN. NFS (like CIFS) is a distributed filesystem that does not require any shared block storage or SAN appliance. GFS and NFS are entirely different. NFS works well for some purposes, for others a clustered filesystem is superior. Note that you can export a GFS filesystem to mount it remotely with NFS or CIFS, so the two are not mutually exclusive at all. That can be a good approach to build a fault-tolerant NFS service as well. -Jeff From carlopmart at gmail.com Tue Oct 13 15:35:34 2009 From: carlopmart at gmail.com (carlopmart) Date: Tue, 13 Oct 2009 17:35:34 +0200 Subject: [Linux-cluster] using NFS as a shared storage for RHCS In-Reply-To: <64D0546C5EBBD147B75DE133D798665F03F3ECF7@hugo.eprize.local> References: <4AD454EA.2090802@gmail.com> <64D0546C5EBBD147B75DE133D798665F03F3ECF7@hugo.eprize.local> Message-ID: <4AD49E46.8050803@gmail.com> Jeff Sturm wrote: >> -----Original Message----- >> Due to limitations and performance problems that contanins GFS and > GFS2, I think >> to use a OpenSolaris NFS server (with ZFS) to serve shared storage for > three cluster >> nodes using RHEL5.4. >> >> Somebody have tried this type of configruation?? any special > recommendations?? > > You're probably not going to get any good recommendations unless you can > tell us something about your application. > > GFS and GFS2 (like OCFS) are clustered filesystems that work at a layer > above shared block storage (iSCSI, FC, etc.) SAN. > > NFS (like CIFS) is a distributed filesystem that does not require any > shared block storage or SAN appliance. GFS and NFS are entirely > different. NFS works well for some purposes, for others a clustered > filesystem is superior. > > Note that you can export a GFS filesystem to mount it remotely with NFS > or CIFS, so the two are not mutually exclusive at all. That can be a > good approach to build a fault-tolerant NFS service as well. > > -Jeff > > I need to install three basic services on this cluster: a corporative proxy (squid), MTA outbound server (postfix) and a dns slave service. My problem is that I can't use noatime,nodiratime flags if i use GFS/GFS2 to deploy these services because all needs this flags activated ... and I don't want to use external software like ocfs2 ... -- CL Martinez carlopmart {at} gmail {d0t} com From dwight.hubbard at efausol.com Tue Oct 13 17:41:49 2009 From: dwight.hubbard at efausol.com (dwight.hubbard at efausol.com) Date: Tue, 13 Oct 2009 10:41:49 -0700 (PDT) Subject: [Linux-cluster] using NFS as a shared storage for RHCS In-Reply-To: <1434676433.5451255455579975.JavaMail.root@pdxsv0.efausol.com> Message-ID: <1848274437.5471255455709313.JavaMail.root@pdxsv0.efausol.com> NFS performance on top of ZFS can be very poor unless you have NVram or battery backed cache. Dwight Hubbard, RHCE/VCP Systems Architect, Effective Automation Solutions Inc Email: dwight.hubbard at efausol.coom Phone: 503.951.3617 ----- Original Message ----- From: "carlopmart" To: "linux clustering" Sent: Tuesday, October 13, 2009 3:22:34 AM GMT -08:00 US/Canada Pacific Subject: [Linux-cluster] using NFS as a shared storage for RHCS Hi all, Due to limitations and performance problems that contanins GFS and GFS2, I think to use a OpenSolaris NFS server (with ZFS) to serve shared storage for three cluster nodes using RHEL5.4. Somebody have tried this type of configruation?? any special recommendations?? Thanks. -- CL Martinez carlopmart {at} gmail {d0t} com -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux-cluster at lists.grepular.com Tue Oct 13 18:19:57 2009 From: linux-cluster at lists.grepular.com (Mike Cardwell) Date: Tue, 13 Oct 2009 19:19:57 +0100 Subject: [Linux-cluster] using NFS as a shared storage for RHCS In-Reply-To: <4AD49E46.8050803@gmail.com> References: <4AD454EA.2090802@gmail.com> <64D0546C5EBBD147B75DE133D798665F03F3ECF7@hugo.eprize.local> <4AD49E46.8050803@gmail.com> Message-ID: <4AD4C4CD.4050809@lists.grepular.com> carlopmart wrote: > I need to install three basic services on this cluster: a corporative > proxy (squid), MTA outbound server (postfix) and a dns slave service. My > problem is that I can't use noatime,nodiratime flags if i use GFS/GFS2 > to deploy these services because all needs this flags activated ... and > I don't want to use external software like ocfs2 ... Why do you think you need a clustered filesystem then? None of those services require one... I wouldn't recommend adding the complexity of a clustered filesystem unless you really have to; all it will do is reduce the reliability of the system. Each of those applications can work with their own standalone local filesystems... -- Mike Cardwell - IT Consultant and LAMP developer Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/ Technical Blog: https://secure.grepular.com/blog/ From brem.belguebli at gmail.com Tue Oct 13 21:48:00 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Tue, 13 Oct 2009 23:48:00 +0200 Subject: [Linux-cluster] using NFS as a shared storage for RHCS In-Reply-To: <4AD4C4CD.4050809@lists.grepular.com> References: <4AD454EA.2090802@gmail.com> <64D0546C5EBBD147B75DE133D798665F03F3ECF7@hugo.eprize.local> <4AD49E46.8050803@gmail.com> <4AD4C4CD.4050809@lists.grepular.com> Message-ID: <29ae894c0910131448r1338e1e2g8270c4925156ddf6@mail.gmail.com> Definitely right in saying that all those apps do definitely not need a clustered FS. If one of your concerns is high availability, local FS cannot be the option unless some DRBD setup is done (I have no idea on how complex or not it can be). The classical way, would be to use shared storage (block device based--> cannot be NFS) FC, ISCSI, on which LVM (CLVM or HA LVM) volumes are built holding themselves legacy FS (ext3). Ideally, CLVM is my prefered option, the only problem with it being that the Redhat cluster shipped resource script doesn't handle exclusive activation "yet" (something needed when runnning active/passive clustered services with CLVM) and you'll need to use instead Rafael Mico Miranda resource script (in this post https://www.redhat.com/archives/cluster-devel/2009-June/msg00020.html) 2009/10/13 Mike Cardwell : > carlopmart wrote: > >> I need to install three basic services on this cluster: a corporative >> proxy (squid), MTA outbound server (postfix) and a dns slave service. My >> problem is that I can't use noatime,nodiratime flags if i use GFS/GFS2 to >> deploy these services because all needs this flags activated ... and I don't >> want to use external software like ocfs2 ... > > Why do you think you need a clustered filesystem then? None of those > services require one... I wouldn't recommend adding the complexity of a > clustered filesystem unless you really have to; all it will do is reduce the > reliability of the system. Each of those applications can work with their > own standalone local filesystems... > > -- > Mike Cardwell - IT Consultant and LAMP developer > Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/ > Technical Blog: https://secure.grepular.com/blog/ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From carlopmart at gmail.com Wed Oct 14 11:22:45 2009 From: carlopmart at gmail.com (carlopmart) Date: Wed, 14 Oct 2009 13:22:45 +0200 Subject: [Linux-cluster] Some problems using fence_vmware_ng under ESXi 4 Message-ID: <4AD5B485.3000103@gmail.com> Hi all, I have installed two rhel5.4 nodes virtual guests, el5prodnode01 and el5prodnode02, under esxi 4 host and I need to use fence_vmware_ng as a fence device. All works ok except when ESXi starts or is rebooted. I have configured under ESXi host to start automatically el5prodnode01 only when host is rebooted or starts, but when el5prodnode01 guest automatically starts tries to launch el5prodnode02 every time. Why? Is this the normal procedure for fence_vmware_ng device?? How can I stop this feature or malfunction?? My cluster.conf is: -- Lon From gigi.mathew-1 at nasa.gov Fri Oct 23 05:34:09 2009 From: gigi.mathew-1 at nasa.gov (Mathew, Gigi (JSC-EG)[Jacobs Technology]) Date: Fri, 23 Oct 2009 00:34:09 -0500 Subject: [Linux-cluster] GFS File System Show 0 Byte Message-ID: Hi: I recently installed GFS2 files systems in a clustered environment and noticed that some of the user directories are showing zero byte in size including the . and .. directories but if you do a ls -al you will see that it has subdirectories and files. If you go to the subdirectory, I do see the same pattern. But this is NOT for all users, but on certain directories only. I did some search and so forth no luck. Has any seen this behavior before? Has anyone has any suggestions? Thanks Gigi Mathew -------------- next part -------------- An HTML attachment was scrubbed... URL: From Martin.Waite at datacash.com Fri Oct 23 16:26:51 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Fri, 23 Oct 2009 17:26:51 +0100 Subject: [Linux-cluster] some questions about rgmanager Message-ID: Hi, Are there any guidelines about how to write resource scripts that will be run by rgmanager /clurgmgrd ? I have been tracing execution through rg_test, but I don't know how representative this is. For example, performing a service check through rg_test calls just about every script in /usr/share/cluster with the "meta-data" command, then calling service.sh with command "status", and finally the resource script with the command "status". Is this what will happen when clurgmgrd starts or stops a service ? Is there a specification covering the environment variables supplied to the resource scripts - eg. OCF_RESOURCE_INSTANCE ? Are the actions of the various scripts documented or specified somewhere ? Do they tend to change across releases ? Is there a standard way of extending the monitoring performed by the scripts, or do I just edit the supplied scripts to suit ? During experiments in configuring a service, the cluster often reached a state where clustat reports a service as "failed". What is the best way of recovering from this state ? I cannot see that clusvcadm can be used to recover from this state, and so far the only path to recovery appears to be to restart rgmanager on all cluster nodes. Thanks in advance for any pointers on this. -- Martin From brem.belguebli at gmail.com Fri Oct 23 18:20:41 2009 From: brem.belguebli at gmail.com (brem belguebli) Date: Fri, 23 Oct 2009 20:20:41 +0200 Subject: [Linux-cluster] some questions about rgmanager In-Reply-To: References: Message-ID: <29ae894c0910231120y21cb6ed1q2a106e359b0dc399@mail.gmail.com> 2009/10/23 Martin Waite : > Hi, > > Are there any guidelines about how to write resource scripts that will > be run by rgmanager /clurgmgrd ? > > I have been tracing execution through rg_test, but I don't know how > representative this is. ?For example, performing a service check through > rg_test calls just about every script in /usr/share/cluster with the > "meta-data" command, then calling service.sh with command "status", and > finally the resource script with the command "status". ? Is this what > will happen when clurgmgrd starts or stops a service ? > > Is there a specification covering the environment variables supplied to > the resource scripts - eg. OCF_RESOURCE_INSTANCE ? Usefull info can be found at http://sources.redhat.com/cluster/wiki/RGManager > > Are the actions of the various scripts documented or specified somewhere > ? ? Do they tend to change across releases ? > > Is there a standard way of extending the monitoring performed by the > scripts, or do I just edit the supplied scripts to suit ? > > During experiments in configuring a service, the cluster often reached a > state where clustat reports a service as "failed". ?What is the best way > of recovering from this state ? ?I cannot see that clusvcadm can be used > to recover from this state, and so far the only path to recovery appears > to be to restart rgmanager on all cluster nodes. > >From my experience, no need from restarting rgmanager, just disable the failed service (clusvcadm -D myfailedservice,), find out/fix what caused the service to fail (in general scripting errors), restart the service (clusvcadm -e myfailedservice) > Thanks in advance for any pointers on this. > > -- Martin > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From jakov.sosic at srce.hr Sat Oct 24 18:56:15 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sat, 24 Oct 2009 20:56:15 +0200 Subject: [Linux-cluster] Virsh instead of XM? In-Reply-To: <1256224381.14284.36.camel@localhost.localdomain> References: <20091015234840.5b91e2ab@nb-jsosic> <1256224381.14284.36.camel@localhost.localdomain> Message-ID: <20091024205615.0fd912a3@nb-jsosic> On Thu, 22 Oct 2009 11:13:01 -0400 Lon Hohberger wrote: > Put your domains in /etc/xen and remove the path attribute. 'virsh' > does not have a way to specify alternative search paths, consequently, > using 'path' requires use of 'xm'. Here is line from cluster.conf: And here is my /etc/xen/myvm1.xml: import os, re arch = os.uname()[4] if re.search('64', arch): arch_libdir = 'lib64' else: arch_libdir = 'lib' kernel = "/usr/lib/xen/boot/hvmloader" builder='hvm' memory = 512 shadow_memory = 8 name = "myvm1" uuid = "06ef0bfe-1162-4fc4-15d8-11b92ee4a000" vif = [ 'type=ioemu, bridge=xenbr0' ] disk = [ 'phy:/dev/VolGroup0/myvm1,ioemu:hda,w', 'phy:/dev/VolGroup0/myvm1_data,ioemu:hdb,w', 'file:/xen/local/iso-images/winxp-sp2.iso,ioemu:hdc:cdrom,r' ] on_poweroff = 'destroy' on_reboot = 'restart' on_crash = 'restart' device_model = '/usr/lib64/xen/bin/qemu-dm' boot="dc" sdl=0 vnc=1 vnclisten="0.0.0.0" vncdisplay=13 vncunused=0 vncconsole=1 vncpasswd='' stdvga=0 serial='pty' usbdevice='tablet' And here's the part from /var/log/messages, when I try to enable machine through cluster management: Oct 24 20:45:49 lego01 clurgmgrd[5873]: Starting stopped service vm:myvm1 Oct 24 20:45:49 lego01 libvirtd: 20:45:49.910: error : /etc/xen/myvm1.xml:1: expecting an assignment Oct 24 20:45:49 lego01 libvirtd: 20:45:49.910: error : /etc/xen/xenscreenrc:1: expecting an assignment Oct 24 20:45:49 lego01 libvirtd: 20:45:49.950: error : /etc/xen/myvm1.xml:1: expecting an assignment Oct 24 20:45:49 lego01 libvirtd: 20:45:49.950: error : /etc/xen/xenscreenrc:1: expecting an assignment Oct 24 20:45:49 lego01 clurgmgrd[5873]: start on vm "myvm1" returned 1 (generic error) Oct 24 20:45:49 lego01 clurgmgrd[5873]: #68: Failed to start vm:myvm1; return value: 1 Oct 24 20:45:49 lego01 clurgmgrd[5873]: Stopping service vm:myvm1 Oct 24 20:45:50 lego01 libvirtd: 20:45:50.258: error : /etc/xen/myvm1.xml:1: expecting an assignment Oct 24 20:45:50 lego01 libvirtd: 20:45:50.258: error : /etc/xen/xenscreenrc:1: expecting an assignment Oct 24 20:45:50 lego01 libvirtd: 20:45:50.482: error : /etc/xen/myvm1.xml:1: expecting an assignment Oct 24 20:45:50 lego01 libvirtd: 20:45:50.482: error : /etc/xen/xenscreenrc:1: expecting an assignment Oct 24 20:45:50 lego01 clurgmgrd[5873]: Service vm:myvm1 is recovering Oct 24 20:45:50 lego01 clurgmgrd[5873]: #71: Relocating failed service vm:myvm1 Oct 24 20:45:53 lego01 clurgmgrd[5873]: Service vm:xp-mgmt is stopped Where could be the problem? > Federico's 'xmlfile' patch allows vm.sh to take a name + a full path > to a libvirt .xml file for using virsh, but it's not in RHEL. And where can I find that patch? In case I don't succeed in getting Virsh to run like it should :-/ -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From samer.murad at its.ws Sun Oct 25 08:40:16 2009 From: samer.murad at its.ws (Samer ITS) Date: Sun, 25 Oct 2009 11:40:16 +0300 Subject: [Linux-cluster] High availability mail server Message-ID: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> Dears I have 4 servers. And I need to install high availability mail system (Postfix, Dovect, Spam, mailscanner, ..) Best regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakov.sosic at srce.hr Sun Oct 25 13:34:14 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sun, 25 Oct 2009 14:34:14 +0100 Subject: [Linux-cluster] Virsh instead of XM? In-Reply-To: <20091024205615.0fd912a3@nb-jsosic> References: <20091015234840.5b91e2ab@nb-jsosic> <1256224381.14284.36.camel@localhost.localdomain> <20091024205615.0fd912a3@nb-jsosic> Message-ID: <20091025143414.1b00f83f@nb-jsosic> On Sat, 24 Oct 2009 20:56:15 +0200 Jakov Sosic wrote: > import os, re > arch = os.uname()[4] > if re.search('64', arch): > arch_libdir = 'lib64' > else: > arch_libdir = 'lib' Apparently this is a problem for virsh. I changed that to simple: arch_libdir = 'lib64' and now everything works. Also I had to set up MAC addresses for interfaces for fully virtualized guests. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From finnzi at finnzi.com Sun Oct 25 18:54:06 2009 From: finnzi at finnzi.com (=?ISO-8859-1?Q?Finnur_=D6rn_Gu=F0mundsson?=) Date: Sun, 25 Oct 2009 18:54:06 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> Message-ID: <4AE49ECE.6080000@finnzi.com> On 10/25/09 8:40 AM, Samer ITS wrote: > > Dears > > I have 4 servers. And I need to install high availability mail system > (Postfix, Dovect, Spam, mailscanner, ....) > > Best regards > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi there, Now that's nice.....Is there a question in there somewhere? Coming here and asking people to spoonfeed you with configurations and helping before you show any sign of trying these things out your self is very unrealistic. Most of the people on this list are systems administrators, and most of them are very found of seeing people TRY things out them selves before asking. In your configuration i would say that you might need to read up on those programs and see if they can be clustered or even need to, and then read up on Linux HA or RHCS before trying to put this all into production. Running a cluster in production without even knowing the slightest thing about RHCS/Linux HA will bite you in the ass later on, and i am sure your boss would love to know why his mail was offline for 36 hours after your cluster goes wah wah and you try to debug what is wrong without knowing a single thing about it except how to restart the cluster daemons. I'm sorry if i offend you in any way, but you might want to consider hiring a consultant to help you out there.....this is not something you would want to toy with and run without any knowledge...... Hope this encourages you to run a quick Google search before sending a email of what you want to do to a sysadmin mailing list ;) Bgrds, Finnzi -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sun Oct 25 21:39:40 2009 From: linux at alteeve.com (Madison Kelly) Date: Sun, 25 Oct 2009 17:39:40 -0400 Subject: [Linux-cluster] High availability mail server In-Reply-To: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> Message-ID: <4AE4C59C.90908@alteeve.com> Samer ITS wrote: > Dears > > > > I have 4 servers. And I need to install high availability mail system > (Postfix, Dovect, Spam, mailscanner, ?.) > > > > Best regards Hi Damer, Hi have to agree with Finnur, if in idea and not tone. :) Clustering is not very easy. It is complex and, as I've learned myself over and over while learning it, it is very easy to break. You need to understand how all the parts tie together so that when you do need to change something, you know how the other components will react. There is no simple canned solution around this -- It's just too small a segment of users. Answer these questions; a) What if your goal with clustering? Performance? Uptime? Scalability? b) These four servers, do they have mechanisms for fencing like IPMI control boards? c) Do you understand terms like Heartbeat, Cluster-aware file systems, iSCSI/SAN/NAS, Fencing, Split-breain, Quorum? d) What operating system and what cluster tools do you plan to/want to use? I'm not trying to discourage you from clustering, honestly. I am, however, trying to keep you from getting in way over your head. I've been playing with clustering for a couple years now and I still constantly run into issues for no other reason than that I simply didn't understand the systems well enough. As for your specific application; Have you though about how you are going to handle user authentication? Will you need to support multiple domains? You have a *lot* of variables you need to sort out in the mail system alone before you can start worrying about the cluster component. If you have a short time line to launch, hire a consultant who is already familiar with clustering. If you have lots of time, start reading and then come back here with specific questions, including steps you've tried, specific errors you've run into and so on. You will find this list very helpful when you show that you've put some effort into your system already. Best of luck! Madi From samer.murad at its.ws Mon Oct 26 06:29:11 2009 From: samer.murad at its.ws (Samer ITS) Date: Mon, 26 Oct 2009 09:29:11 +0300 Subject: [Linux-cluster] High availability mail server In-Reply-To: <4AE4C59C.90908@alteeve.com> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> Message-ID: <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> Dear Madi, I have 2 years of working with sun cluster 3.x So the concept is there, so I want to know how linux clustering is work for mail system Because I still planning for the project. The cluster is needed for Performance & high availability. So I need your advice in the planning & the best requirement for best performance (such servers) Already I have SAN storage (EMC Clariion - CX300) But I didn't buy the server until now but may it will be HP servers. Many thanks in advance. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Madison Kelly Sent: Monday, October 26, 2009 12:40 AM To: linux clustering Subject: Re: [Linux-cluster] High availability mail server Samer ITS wrote: > Dears > > > > I have 4 servers. And I need to install high availability mail system > (Postfix, Dovect, Spam, mailscanner, ..) > > > > Best regards Hi Damer, Hi have to agree with Finnur, if in idea and not tone. :) Clustering is not very easy. It is complex and, as I've learned myself over and over while learning it, it is very easy to break. You need to understand how all the parts tie together so that when you do need to change something, you know how the other components will react. There is no simple canned solution around this -- It's just too small a segment of users. Answer these questions; a) What if your goal with clustering? Performance? Uptime? Scalability? b) These four servers, do they have mechanisms for fencing like IPMI control boards? c) Do you understand terms like Heartbeat, Cluster-aware file systems, iSCSI/SAN/NAS, Fencing, Split-breain, Quorum? d) What operating system and what cluster tools do you plan to/want to use? I'm not trying to discourage you from clustering, honestly. I am, however, trying to keep you from getting in way over your head. I've been playing with clustering for a couple years now and I still constantly run into issues for no other reason than that I simply didn't understand the systems well enough. As for your specific application; Have you though about how you are going to handle user authentication? Will you need to support multiple domains? You have a *lot* of variables you need to sort out in the mail system alone before you can start worrying about the cluster component. If you have a short time line to launch, hire a consultant who is already familiar with clustering. If you have lots of time, start reading and then come back here with specific questions, including steps you've tried, specific errors you've run into and so on. You will find this list very helpful when you show that you've put some effort into your system already. Best of luck! Madi -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Mon Oct 26 09:05:23 2009 From: gordan at bobich.net (Gordan Bobic) Date: Mon, 26 Oct 2009 09:05:23 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> Message-ID: <4AE56653.4060508@bobich.net> Samer ITS wrote: > Dear Madi, > > I have 2 years of working with sun cluster 3.x > > So the concept is there, so I want to know how linux clustering is work for > mail system > Because I still planning for the project. > > The cluster is needed for Performance & high availability. > > So I need your advice in the planning & the best requirement for best > performance (such servers) > Already I have SAN storage (EMC Clariion - CX300) Unfortunately, the very nature of the way Maildir behaves (lots of files in few directories) makes it very at odds with your intention of using clustering for gaining performance through parallel operation using SAN backed storage with a cluster file system (being GFS1, GFS2 or OCFS2). You'll be better off with a NAS, but even so, you'll find that one machine with local storage will still perform a pair of clustered machines running in parallel under heavy load, because the clustered machines will spend most of their time bouncing locks around. You may find that configuring them as fail-over with an ext3 volume on the SAN that gets mounted _ONLY_ on the machine that's currently running the service works faster. The problem is that most people overlook why clustered file systems are so slow, given the apparently low ping times to the SAN and between machines on gigabit ethernet (or something faster). The generally erroneous assumption is that given that the ping time is typically < 0.1ms, this is negligible compared to the 4-8ms access time of mechanical disks. The problem is that 4-8ms is the wrong figure to be comparing to - if the machine is really hitting the disk for every data fetch, it is going to grind to a halt (think heavy swapping sort of performance). Most of the working data set is expected to be in caches most of the time, which is accessible in < 40ns (when all the latencies between the CPU, MCH and RAM are accounted for). The cluster file system takes this penalty for all accesses where a lock isn't cached (and if both machines are accessing the same data set randomly, the locks aren't going to be locally held most of the time). This may well be fine when you are dealing with large-ish files and your workload is arranged in such a way that accesses to particular data subtrees is typically executed on only one node at a time, but for cases such as a large Maildir being randomly accessed, from multiple nodes, you'll find the performance will tend to fall off a cliff pretty quickly as the number of users and concurrent accesses starts to increase. The only way you are likely to overcome this is by logically partitioning your data sets. Gordan From Martin.Waite at datacash.com Mon Oct 26 11:05:18 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Mon, 26 Oct 2009 11:05:18 -0000 Subject: [Linux-cluster] some questions about rgmanager In-Reply-To: <29ae894c0910231120y21cb6ed1q2a106e359b0dc399@mail.gmail.com> References: <29ae894c0910231120y21cb6ed1q2a106e359b0dc399@mail.gmail.com> Message-ID: Hi Brem, Thanks for the pointers. The link to "OCF RA API Draft" appears to answer my questions. It will take a while to digest all that. I think you had a typo - "clusvcadm -D myfailedservice" should be "clusvcadm -d myfailedservice". My service (mysql) was failing because "shutdown_wait" was too low, causing stops and restarts to fail. Sure enough, your suggestion works: sudo /usr/sbin/clusvcadm -d mysql_service (fix config) sudo /usr/sbin/clusvcadm -e mysql_service And I suppose that if the service is in a mess on its current node - eg. software error prevents shutdown - then I would disable and then relocate the service: sudo /usr/sbin/clusvcadm -d mysql_service sudo /usr/sbin/clusvcadm -e mysql_service -m othernode regards, Martin -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of brem belguebli Sent: 23 October 2009 19:21 To: linux clustering Subject: Re: [Linux-cluster] some questions about rgmanager 2009/10/23 Martin Waite : > Hi, > > Are there any guidelines about how to write resource scripts that will > be run by rgmanager /clurgmgrd ? > > I have been tracing execution through rg_test, but I don't know how > representative this is. ?For example, performing a service check through > rg_test calls just about every script in /usr/share/cluster with the > "meta-data" command, then calling service.sh with command "status", and > finally the resource script with the command "status". ? Is this what > will happen when clurgmgrd starts or stops a service ? > > Is there a specification covering the environment variables supplied to > the resource scripts - eg. OCF_RESOURCE_INSTANCE ? Usefull info can be found at http://sources.redhat.com/cluster/wiki/RGManager > > Are the actions of the various scripts documented or specified somewhere > ? ? Do they tend to change across releases ? > > Is there a standard way of extending the monitoring performed by the > scripts, or do I just edit the supplied scripts to suit ? > > During experiments in configuring a service, the cluster often reached a > state where clustat reports a service as "failed". ?What is the best way > of recovering from this state ? ?I cannot see that clusvcadm can be used > to recover from this state, and so far the only path to recovery appears > to be to restart rgmanager on all cluster nodes. > >From my experience, no need from restarting rgmanager, just disable the failed service (clusvcadm -D myfailedservice,), find out/fix what caused the service to fail (in general scripting errors), restart the service (clusvcadm -e myfailedservice) > Thanks in advance for any pointers on this. > > -- Martin > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Mon Oct 26 12:52:38 2009 From: linux at alteeve.com (Madison Kelly) Date: Mon, 26 Oct 2009 08:52:38 -0400 Subject: [Linux-cluster] High availability mail server In-Reply-To: <4AE56653.4060508@bobich.net> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <4AE56653.4060508@bobich.net> Message-ID: <4AE59B96.4070409@alteeve.com> Gordan Bobic wrote: > Samer ITS wrote: >> Dear Madi, >> >> I have 2 years of working with sun cluster 3.x >> >> So the concept is there, so I want to know how linux clustering is >> work for >> mail system >> Because I still planning for the project. >> >> The cluster is needed for Performance & high availability. >> >> So I need your advice in the planning & the best requirement for best >> performance (such servers) >> Already I have SAN storage (EMC Clariion - CX300) > > Unfortunately, the very nature of the way Maildir behaves (lots of files > in few directories) makes it very at odds with your intention of using > clustering for gaining performance through parallel operation using SAN > backed storage with a cluster file system (being GFS1, GFS2 or OCFS2). > You'll be better off with a NAS, but even so, you'll find that one > machine with local storage will still perform a pair of clustered > machines running in parallel under heavy load, because the clustered > machines will spend most of their time bouncing locks around. You may > find that configuring them as fail-over with an ext3 volume on the SAN > that gets mounted _ONLY_ on the machine that's currently running the > service works faster. > > The problem is that most people overlook why clustered file systems are > so slow, given the apparently low ping times to the SAN and between > machines on gigabit ethernet (or something faster). The generally > erroneous assumption is that given that the ping time is typically < > 0.1ms, this is negligible compared to the 4-8ms access time of > mechanical disks. The problem is that 4-8ms is the wrong figure to be > comparing to - if the machine is really hitting the disk for every data > fetch, it is going to grind to a halt (think heavy swapping sort of > performance). Most of the working data set is expected to be in caches > most of the time, which is accessible in < 40ns (when all the latencies > between the CPU, MCH and RAM are accounted for). > > The cluster file system takes this penalty for all accesses where a lock > isn't cached (and if both machines are accessing the same data set > randomly, the locks aren't going to be locally held most of the time). > > This may well be fine when you are dealing with large-ish files and your > workload is arranged in such a way that accesses to particular data > subtrees is typically executed on only one node at a time, but for cases > such as a large Maildir being randomly accessed, from multiple nodes, > you'll find the performance will tend to fall off a cliff pretty quickly > as the number of users and concurrent accesses starts to increase. > > The only way you are likely to overcome this is by logically > partitioning your data sets. > > Gordan Expanding further on Gordon's post; If you really want to have performance and high availability, you might be better off with a simple two-node cluster using a shared, Primary/Secondary DRBD setup. This will make sure that you data is always duplicated on both nodes without taking the huge locking hit that Gordon is talking about. Then you can use HA Linux/Heartbeat to handle fail-over in the case of the primary node dieing. This setup should be fairly straight forward to setup. If you want more help, be sure to ask more specific questions. Madi From gordon.k.miller at boeing.com Mon Oct 26 14:34:06 2009 From: gordon.k.miller at boeing.com (Miller, Gordon K) Date: Mon, 26 Oct 2009 07:34:06 -0700 Subject: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU Message-ID: Occasionally, we encounter a condition where the CPU system time increases dramatically (30% to 100% of total CPU time) for a period of several seconds to 10's of minutes. Using oprofile we observed that the majority of CPU time was being spent in gfs2_bitfit with rgblk_search and try_rgrp_unlink in the backtrace. Further instrumentation using SystemTap has shown try_rgrp_unlink being called repeatedly during the period of high system usage with durations averaging 400 milliseconds on each call. Often , try_rgrp_unlink will return the same inode as in previous calls. Attached is output from oprofile and a SystemTap probe on the return from try_rgrp_unlink with the number of times rgblk_search (rgblk_search_count) and gfs2_bitfit (bitfit_count) were called during this invocation of try_rgrp_unlink, the duration in seconds of the try_rgrp_unlink function, selected elements of the rgd structure and the returned inode (return->i_ino). In this case, the behavior persisted for 15 minutes beyond the capture listed here. The SystemTap scripts used in this capture follow the output. Our kernel version is 2.6.18-128.7.1 plus the patch to gfs2_bitfit contained in linux-2.6-gfs2-unaligned-access-in-gfs2_bitfit.patch. Has anyone experienced this behavior? Oprofile output: CPU: Core Solo / Duo, speed 2000.21 MHz (estimated) Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % linenr info image name app name symbol name 742159 742159 47.2355 47.2355 rgrp.c:181 gfs2.ko gfs2 gfs2_bitfit 208285 950444 13.2565 60.4920 (no location information) stap_ff8ca210ca2a6219e30b9c0725d9186c_27191 stap_ff8ca210ca2a6219e30b9c0725d9186c_27191 (no symbols) 202589 1153033 12.8940 73.3860 process.c:0 vmlinux vmlinux mwait_idle ... 11610 1304239 0.7389 83.0096 rgrp.c:1336 gfs2.ko gfs2 rgblk_search ... 2430 1394001 0.1547 88.7226 rgrp.c:912 gfs2.ko gfs2 try_rgrp_unlink SystemTap output: Time Host 19:48:23.224 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.406866 cpu=2 ppid=8791 pid=9195 tid=10562 java 19:48:23.224 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:23.224 l-ce2 return->i_ino=56897 -- 19:48:23.644 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.419159 cpu=1 ppid=8791 pid=10173 tid=10423 java 19:48:23.644 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:23.644 l-ce2 return->i_ino=56897 -- 19:48:24.056 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.412728 cpu=0 ppid=8791 pid=9195 tid=10536 java 19:48:24.056 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:24.056 l-ce2 return->i_ino=56897 -- 19:48:24.467 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.409639 cpu=3 ppid=8791 pid=9195 tid=9282 java 19:48:24.467 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:24.467 l-ce2 return->i_ino=56897 -- 19:48:24.876 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.409134 cpu=3 ppid=8791 pid=9195 tid=10540 java 19:48:24.876 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:24.876 l-ce2 return->i_ino=56897 -- 19:48:25.276 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.398694 cpu=2 ppid=8791 pid=9195 tid=10535 java 19:48:25.276 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:25.276 l-ce2 return->i_ino=56897 -- 19:48:25.693 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.417088 cpu=0 ppid=8769 pid=8791 tid=8892 java 19:48:25.693 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:25.693 l-ce2 return->i_ino=56897 -- 19:48:26.107 l-ce2 try_rgrp_unlink return=4147791632 rgblk_search_count=56844 bitfit_count=153324 duration=0.414038 cpu=0 ppid=8790 pid=20555 tid=20555 df 19:48:26.107 l-ce2 last_unlinked=56869 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:26.107 l-ce2 return->i_ino=56869 -- 19:48:26.509 l-ce2 try_rgrp_unlink return=4141184592 rgblk_search_count=56870 bitfit_count=153350 duration=0.402198 cpu=3 ppid=8791 pid=9068 tid=9418 java 19:48:26.509 l-ce2 last_unlinked=56897 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:26.509 l-ce2 return->i_ino=56897 -- 19:48:26.919 l-ce2 try_rgrp_unlink return=4147791632 rgblk_search_count=56844 bitfit_count=153324 duration=0.409572 cpu=3 ppid=8791 pid=9195 tid=9266 java 19:48:26.919 l-ce2 last_unlinked=56869 rgd->rd_addr=17, rgd->rd_data0=22, rgd->rd_data=65548, rgd->rd_length=5 rgd->rd_last_alloc=58763 rgd->rd_flags=0x5 19:48:26.919 l-ce2 return->i_ino=56869 SystemTap probe: global rgblk_search_count global bitfit_count function tod:string() { msec = gettimeofday_ns() /1000000 sec = msec / 1000 msec = msec - (sec * 1000) return sprintf("%s.%03d", substr(ctime(sec), 11, 8), msec) } probe module("gfs2").function("try_rgrp_unlink") { tid=tid() tids[tid]=tid rgblk_search_count[tid]=0 bitfit_count[tid]=0 try_rgrp_unlink_start[tid]=gettimeofday_us() } probe module("gfs2").function("try_rgrp_unlink").return { tid=tid() duration=gettimeofday_us() - try_rgrp_unlink_start[tid] dur_sec=duration/1000000 dur_usec=duration-(dur_sec * 1000000) time_host=sprintf("%s %s", tod(), hostname) printf("%s try_rgrp_unlink return=%d rgblk_search_count=%d bitfit_count=%d duration=%d.%06d cpu=%d ppid=%d pid=%d tid=%d %s\n", time_host, $return, rgblk_search_count[tid], bitfit_count[tid], dur_sec, dur_usec, cpu(), ppid(), pid(), tid, execname()) printf("%s last_unlinked=%d rgd->rd_addr=%d, rgd->rd_data0=%d, rgd->rd_data=%d, rgd->rd_length=%d rgd->rd_last_alloc=%d rgd->rd_flags=0x%x \n", time_host, kernel_long($last_unlinked), $rgd->rd_addr, $rgd->rd_data0, $rgd->rd_data, $rgd->rd_length, $rgd->rd_last_alloc, $rgd->rd_flags ) if ($return != 0) printf("%s return->i_ino=%d\n", time_host, $return->i_ino) print("\n") delete try_rgrp_unlink_start[tid] delete bitfit_count[tid] delete rgblk_search_count[tid] tids[tid]=0 } probe module("gfs2").function("rgblk_search") { tid=tid() if (tid == tids[tid]) { rgblk_search_count[tid]++ } } probe module("gfs2").function("gfs2_bitfit") { tid=tid() if (tid == tids[tid]) { bitfit_count[tid]+=1 } } From swhiteho at redhat.com Mon Oct 26 14:46:29 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Mon, 26 Oct 2009 14:46:29 +0000 Subject: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU In-Reply-To: References: Message-ID: <1256568389.6052.676.camel@localhost.localdomain> Hi, On Mon, 2009-10-26 at 07:34 -0700, Miller, Gordon K wrote: > Occasionally, we encounter a condition where the CPU system time increases dramatically (30% to 100% of total CPU time) for a period of several seconds to 10's of minutes. Using oprofile we observed that the majority of CPU time was being spent in gfs2_bitfit with rgblk_search and try_rgrp_unlink in the backtrace. Further instrumentation using SystemTap has shown try_rgrp_unlink being called repeatedly during the period of high system usage with durations averaging 400 milliseconds on each call. Often , try_rgrp_unlink will return the same inode as in previous calls. Attached is output from oprofile and a SystemTap probe on the return from try_rgrp_unlink with the number of times rgblk_search (rgblk_search_count) and gfs2_bitfit (bitfit_count) were called during this invocation of try_rgrp_unlink, the duration in seconds of the try_rgrp_unlink function, selected elements of the rgd structure and the returned inode (return->i_ino). In this case, the behavior persisted fo! r 1 > 5 minutes beyond the capture listed here. The SystemTap scripts used in this capture follow the output. Our kernel version is 2.6.18-128.7.1 plus the patch to gfs2_bitfit contained in linux-2.6-gfs2-unaligned-access-in-gfs2_bitfit.patch. > > Has anyone experienced this behavior? > There are a couple of things which occur to me at this point. Firstly, I wonder what size the rgrp bitmaps are on your filesystem. You can change the sizes of them at mkfs time and that allows a trade off between the size of each rgrp and the number of rgrps. Sometimes altering this can make a difference to the time taken in bitmap searching. Secondly, are you doing anything along the lines of holding inodes open on one node and unlinking them on another node? There was also a bug which was fixed in RHEL 5.4 (and improved again in 5.5) which meant that the dcache was sometimes holding onto inodes longer than it should have. That can also make the situation worse. You should certainly get better results than you appear to be getting here, Steve. From Martin.Waite at datacash.com Mon Oct 26 15:38:16 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Mon, 26 Oct 2009 15:38:16 -0000 Subject: [Linux-cluster] distinguishing managed relocation from failover Message-ID: Hi, Is it possible for the rgmanager scripts to distinguish between a managed relocation of a service from a relocation caused by the failure of the current service ? In the event of a relocation due to failure, I want to perform some reconfiguration of the replacement - in this case, updating some transaction counters in a MySQL server - before it takes over. I can do this using a script resource. However, if the relocation is a managed administration action, I don't want to do this reconfiguration. One approach I am considering is to create some sort of sentinel service, which would be allowed to just fail if the node it is currently bound to fails. Under a managed relocation, the sentinel could be relocated first, and then the script service relocated. The script could then check whether the sentinel was bound to the same node, and if so, not perform the reconfiguration. It sounds a bit convoluted though: is there an easier way ? regards, Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From Martin.Waite at datacash.com Mon Oct 26 17:40:24 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Mon, 26 Oct 2009 17:40:24 -0000 Subject: [Linux-cluster] service state unchanged when host crashes Message-ID: Hi, I have 3 VMs running in a cluster. 4 services are defined, one of which ("SENTINEL") is running on clusternode30. I then suspended clusternode30 in the VM console. Cman notices the disappearance within a few seconds. However, the SENTINEL service that was running is still flagged as "started". martin at clusternode28:/usr/share/cluster$ sudo /usr/sbin/clustat Cluster Status for testcluster @ Mon Oct 26 18:33:48 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ clusternode27 27 Online, rgmanager clusternode28 28 Online, Local, rgmanager clusternode30 30 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:SCRIPT clusternode28 started service:SENTINEL clusternode30 started service:VIP clusternode28 started service:mysql_authdb_service clusternode27 started I expected the state of SENTINEL to go to disabled or failed or something - but not remain unchanged. Does anyone have any ideas what I have done wrong please ? regards, Martin From linux at alteeve.com Mon Oct 26 18:33:27 2009 From: linux at alteeve.com (Madison Kelly) Date: Mon, 26 Oct 2009 14:33:27 -0400 Subject: [Linux-cluster] OT? Odd Xen network issue Message-ID: <4AE5EB77.9010602@alteeve.com> Hi all, I hope this isn't too off-topic being more a Xen than cluster experience, but I am having trouble with the Xen mailing list and I gather many here use Xen. If it is too off-topic, the please have a mod delete it and kick my butt. :) I've got a simple two-node cluster, each with three NICs running CentOS 5.3 x86_64. Each node has this configuration: eth0 - Internal network and VM back channel (IPMI on this NIC) eth1 - DRBD link eth2 - Internet-facing NIC, dom0 has no IP I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's 'eth0' and 'eth1', respectively. I've set a static, public IP to domU's eth1 (xenbr2) but I can't ping it from a workstation nor can the domU ping out. I know that the network itself (outside the server) is setup right as I tested the static IP from that cable earlier on a test machine. Could someone hit me with a clue-stick as to what I might be doing wrong? Let me know if any logs or config files are helpful. Thanks! Madi From ricks at nerd.com Mon Oct 26 21:01:30 2009 From: ricks at nerd.com (Rick Stevens) Date: Mon, 26 Oct 2009 14:01:30 -0700 Subject: [Linux-cluster] OT? Odd Xen network issue In-Reply-To: <4AE5EB77.9010602@alteeve.com> References: <4AE5EB77.9010602@alteeve.com> Message-ID: <4AE60E2A.1060705@nerd.com> Madison Kelly wrote: > Hi all, > > I hope this isn't too off-topic being more a Xen than cluster > experience, but I am having trouble with the Xen mailing list and I > gather many here use Xen. If it is too off-topic, the please have a mod > delete it and kick my butt. :) > > I've got a simple two-node cluster, each with three NICs running CentOS > 5.3 x86_64. Each node has this configuration: > > eth0 - Internal network and VM back channel (IPMI on this NIC) > eth1 - DRBD link > eth2 - Internet-facing NIC, dom0 has no IP > > I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's > 'eth0' and 'eth1', respectively. I've set a static, public IP to domU's > eth1 (xenbr2) but I can't ping it from a workstation nor can the domU > ping out. I know that the network itself (outside the server) is setup > right as I tested the static IP from that cable earlier on a test machine. > > Could someone hit me with a clue-stick as to what I might be doing > wrong? Let me know if any logs or config files are helpful. First, see if you can ping dom0's NICs? If not, that's the first thing to address. Next, are the domUs NICs up? ("ifconfig -a" will tell you). If not, bring them up. Remember, you might have a fight if the domUs use Gnome's NetworkManager. Make sure it's disabled and that you have the classic configuration set up in the domUs correctly. Verify that first, and we'll move on from there. You might also want to have a look at this: http://wiki.xensource.com/xenwiki/XenNetworking It might help explain things a bit or give you some other ideas. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer ricks at nerd.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - Have you noticed that "human readable" configuration file - - directives are beginning to resemble COBOL code? - ---------------------------------------------------------------------- From arwin.tugade at csun.edu Mon Oct 26 22:32:58 2009 From: arwin.tugade at csun.edu (Arwin L Tugade) Date: Mon, 26 Oct 2009 15:32:58 -0700 Subject: [Linux-cluster] High availability mail server In-Reply-To: <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> Message-ID: <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> High avail. Mail? That's what MX records are for. Performance, would be a side effect of multiple MXs. Having it "clustered" wouldn't make mail deliver any quicker. Why make something so simple into something complex? Sorry but wrong mailing list. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Samer ITS Sent: Sunday, October 25, 2009 11:29 PM To: 'linux clustering' Subject: RE: [Linux-cluster] High availability mail server Dear Madi, I have 2 years of working with sun cluster 3.x So the concept is there, so I want to know how linux clustering is work for mail system Because I still planning for the project. The cluster is needed for Performance & high availability. So I need your advice in the planning & the best requirement for best performance (such servers) Already I have SAN storage (EMC Clariion - CX300) But I didn't buy the server until now but may it will be HP servers. Many thanks in advance. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Madison Kelly Sent: Monday, October 26, 2009 12:40 AM To: linux clustering Subject: Re: [Linux-cluster] High availability mail server Samer ITS wrote: > Dears > > > > I have 4 servers. And I need to install high availability mail system > (Postfix, Dovect, Spam, mailscanner, ..) > > > > Best regards Hi Damer, Hi have to agree with Finnur, if in idea and not tone. :) Clustering is not very easy. It is complex and, as I've learned myself over and over while learning it, it is very easy to break. You need to understand how all the parts tie together so that when you do need to change something, you know how the other components will react. There is no simple canned solution around this -- It's just too small a segment of users. Answer these questions; a) What if your goal with clustering? Performance? Uptime? Scalability? b) These four servers, do they have mechanisms for fencing like IPMI control boards? c) Do you understand terms like Heartbeat, Cluster-aware file systems, iSCSI/SAN/NAS, Fencing, Split-breain, Quorum? d) What operating system and what cluster tools do you plan to/want to use? I'm not trying to discourage you from clustering, honestly. I am, however, trying to keep you from getting in way over your head. I've been playing with clustering for a couple years now and I still constantly run into issues for no other reason than that I simply didn't understand the systems well enough. As for your specific application; Have you though about how you are going to handle user authentication? Will you need to support multiple domains? You have a *lot* of variables you need to sort out in the mail system alone before you can start worrying about the cluster component. If you have a short time line to launch, hire a consultant who is already familiar with clustering. If you have lots of time, start reading and then come back here with specific questions, including steps you've tried, specific errors you've run into and so on. You will find this list very helpful when you show that you've put some effort into your system already. Best of luck! Madi -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From gordon.k.miller at boeing.com Mon Oct 26 22:47:55 2009 From: gordon.k.miller at boeing.com (Miller, Gordon K) Date: Mon, 26 Oct 2009 15:47:55 -0700 Subject: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU In-Reply-To: <1256568389.6052.676.camel@localhost.localdomain> References: <1256568389.6052.676.camel@localhost.localdomain> Message-ID: When making our GFS2 filesystems we are using default values with the exception of the journal size which we have set to 16MB. Our resource groups are 443 MB in size for this filesystem. I do not believe that we have the case of unlinking inodes from one node while it is still open on another. Under what conditions would try_rgrp_unlink return the same inode when called repeatedly in a short time frame as seen in the original problem description? I am unable to correlate any call to gfs2_unlink on any node in the cluster with the inodes that try_rgrp_unlink is returning. Gordon From ray at oneunified.net Mon Oct 26 23:54:51 2009 From: ray at oneunified.net (Ray Burkholder) Date: Mon, 26 Oct 2009 20:54:51 -0300 Subject: [Linux-cluster] High availability mail server In-Reply-To: <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> Message-ID: <0e5401ca5697$b62e0480$228a0d80$@net> > > High avail. Mail? That's what MX records are for. Performance, would > be a side effect of multiple MXs. Having it "clustered" wouldn't make > mail deliver any quicker. Why make something so simple into something > complex? > Mail delivery and MX records are easy. But once mail is received, you have to get it to user's mail boxes, and users have to gain access to the repository. The repository should be 'highly available' in some fashion: partitioned storage units, redundant storage, replicated storage, backup storage, or whatever. I believe that is the hard bit: making the repository 'highly available'. How do people do it? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From ricks at nerd.com Tue Oct 27 00:25:00 2009 From: ricks at nerd.com (Rick Stevens) Date: Mon, 26 Oct 2009 17:25:00 -0700 Subject: [Linux-cluster] High availability mail server In-Reply-To: <0e5401ca5697$b62e0480$228a0d80$@net> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> <0e5401ca5697$b62e0480$228a0d80$@net> Message-ID: <4AE63DDC.6050400@nerd.com> Ray Burkholder wrote: >> High avail. Mail? That's what MX records are for. Performance, would >> be a side effect of multiple MXs. Having it "clustered" wouldn't make >> mail deliver any quicker. Why make something so simple into something >> complex? >> > > Mail delivery and MX records are easy. But once mail is received, you have > to get it to user's mail boxes, and users have to gain access to the > repository. The repository should be 'highly available' in some fashion: > partitioned storage units, redundant storage, replicated storage, backup > storage, or whatever. I believe that is the hard bit: making the > repository 'highly available'. > > How do people do it? It rather depends on which services you're going to offer. An NFS volume as a mailstore will work, but doesn't play nicely with multiple POP3 servers accessing it, and it's slow. Lots of POP servers rewrite the entire mailbox when a user logs in (copy it to a working file, futz with that, then rewrite the original when they log out). With IMAP it's OK but still not ideal. The primary problem is the lack of file locking. GFS can address this, but only if the servers recognize locks on the file system (many POP servers have to be recompiled to use this, qpopper being one). If the only locking they do is purely in-kernel and won't deal with a lock manager or filesystem-level locks, then multiple machines will collide on the mailstore and there really isn't any way around it. My recommendation...get and install Cyrus IMAP and POP services, use GFS and a good lock manager (e.g. DLM). That should work. (And yes, I used to manage a system with about 75,000 active users and over 2M mail accounts--I feel your pain). ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer ricks at nerd.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - "Very funny, Scotty. Now beam down my clothes." - ---------------------------------------------------------------------- From gordan at bobich.net Tue Oct 27 00:50:46 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 27 Oct 2009 00:50:46 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <0e5401ca5697$b62e0480$228a0d80$@net> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> <0e5401ca5697$b62e0480$228a0d80$@net> Message-ID: <4AE643E6.7060101@bobich.net> On 26/10/2009 23:54, Ray Burkholder wrote: >> >> High avail. Mail? That's what MX records are for. Performance, would >> be a side effect of multiple MXs. Having it "clustered" wouldn't make >> mail deliver any quicker. Why make something so simple into something >> complex? >> > > Mail delivery and MX records are easy. But once mail is received, you have > to get it to user's mail boxes, and users have to gain access to the > repository. The repository should be 'highly available' in some fashion: > partitioned storage units, redundant storage, replicated storage, backup > storage, or whatever. I believe that is the hard bit: making the > repository 'highly available'. > > How do people do it? Here are some options you have: 1) Use a NAS/NFS box for shared storage - not really a solution for high availability per se, as this becomes a SPOF unless you mirror it somehow in realtime. Performance over NFS will not be great even in a high state of tune due to latency overheads. 2) Use a SAN with a clustered file system for shared storage. Again, not really a solution for high availability unless the SAN itself is mirrored, plus the performance will not be great especially with a lot of concurrent users due to locking latencies. 3) Use a SAN with exclusively mounted non-shared file system (e.g. ext3). Performance should be reasonably good in this case because there is no locking latency overheads or lack of efficient caching. Note, however, that you will have to ensure in your cluster configuration that this ext3 volume is a service that can only be active on one machine at a time. If it ends up accidentally multi-mounted, your data will be gone in a matter of seconds. 2b) Split your user data up in such a way that a particular user will always hit a particular server (unless that server fails), and all the data for users on that server goes to a particular volume, or subtree of a cluster file system (e.g. GFS). This will ensure that all locks for that subtree can be cached on that server, to overcome the locking latency overheads. In options 2 and 3 you could use DRBD instead of a SAN, which would give you advantages of mirroring data between servers and not needing a SAN (this ought to reduce your budget requirements to a small fraction of what it would be with a SAN). Two birds with one stone. You could also use GlusterFS for your mirrored data storage (fuse based, backed by a normal file system, doesn't live on a raw block device). Performance is similar to NFS, but be advised, you'll need to test it for your use case as it is till a bit buggy. There is also another option, that doesn't involve block level or file system level mirroring - DBMail. You can back your mail storage in an SQL database rather Maildir. Point it at MySQL, set up MySQL replication, and you're good to go. At this point you may be thinking about master-master replication and sharing load between the servers. This would be unreliable due to the race conditions inherent in MySQL's master-master replication. You won't lose data, but mail clients assume that the message IDs always go up. That means of two messages get delivered in quick succession, the app might see the later message delivered to the local server, but not the earlier message that got delivered to the other server that hasn't replicated yet. Next time it checks for updates in the inbox, it'll not spot the other message with a lower message ID! The client would have to purge local caches and resync data to see the missing message. This means that with this solution you would still have to run it in fail-over mode (even if both MySQL instances would run at the same time to achieve real-time data mirroring). The only way you could overcome this with MySQL is to use NDB tables, but that brings you back to clustered storage performance issues (performance on NDB tables is pretty attrocious compared to the likes of MyISAM and InnoDB). Anyway, that should be enough to get you started. Gordan From gordan at bobich.net Tue Oct 27 01:00:36 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 27 Oct 2009 01:00:36 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <4AE63DDC.6050400@nerd.com> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <6708F96BBF31F846BFA56EC0AE37D62284C57D7DA6@CSUN-EX-V01.csun.edu> <0e5401ca5697$b62e0480$228a0d80$@net> <4AE63DDC.6050400@nerd.com> Message-ID: <4AE64634.4000209@bobich.net> On 27/10/2009 00:25, Rick Stevens wrote: > Ray Burkholder wrote: >>> High avail. Mail? That's what MX records are for. Performance, would >>> be a side effect of multiple MXs. Having it "clustered" wouldn't make >>> mail deliver any quicker. Why make something so simple into something >>> complex? >>> >> >> Mail delivery and MX records are easy. But once mail is received, you >> have >> to get it to user's mail boxes, and users have to gain access to the >> repository. The repository should be 'highly available' in some fashion: >> partitioned storage units, redundant storage, replicated storage, backup >> storage, or whatever. I believe that is the hard bit: making the >> repository 'highly available'. >> >> How do people do it? > > It rather depends on which services you're going to offer. An NFS > volume as a mailstore will work, but doesn't play nicely with multiple > POP3 servers accessing it, and it's slow. Lots of POP servers rewrite > the entire mailbox when a user logs in (copy it to a working file, futz > with that, then rewrite the original when they log out). With IMAP it's > OK but still not ideal. POP3 can be Maildir backed, just like IMAP, so the performance would be similar. Ironically, if you're using mbox format rather than maildir, the performance may well be better on GFS with concurrent access due to fewer files, up to the point where the bandwidth becomes dominant to latency in the performance equation. > The primary problem is the lack of file locking. GFS can address this, > but only if the servers recognize locks on the file system (many POP > servers have to be recompiled to use this, qpopper being one). Mail servers like Dovecot (both POP3 and IMAP) use Maildir, and this was originally specifically designed to play nice with NFS and locking. > If the only locking they do is purely in-kernel and won't deal with a > lock manager or filesystem-level locks, then multiple machines will > collide on the mailstore and there really isn't any way around it. Maildir uses flock which plays nice with NFS provided you have locking enabled. > My recommendation...get and install Cyrus IMAP and POP services, use > GFS and a good lock manager (e.g. DLM). That should work. You don't get a choice of lock manager on GFS any more. DLM has been the only production option for a long time. The problem is that heavy use (lots of concurrent users) with Maildir on GFS will perform rather poorly unless you logically partition your users in some way (e.g. have a custom proxy in front of it that does load balancing between server nodes based on the hash of the username). This would ensure that the locks can be sanely cached on the server nodes rather than be bounced around all the time. Gordan From edsonmarquezani at gmail.com Tue Oct 27 01:53:38 2009 From: edsonmarquezani at gmail.com (Edson Marquezani Filho) Date: Mon, 26 Oct 2009 23:53:38 -0200 Subject: [Linux-cluster] OT? Odd Xen network issue In-Reply-To: <4AE5EB77.9010602@alteeve.com> References: <4AE5EB77.9010602@alteeve.com> Message-ID: <2fc5f090910261853x45304dbemfe7fc78fb689a35d@mail.gmail.com> On Mon, Oct 26, 2009 at 16:33, Madison Kelly wrote: > Hi all, > > ?I hope this isn't too off-topic being more a Xen than cluster experience, > but I am having trouble with the Xen mailing list and I gather many here use > Xen. If it is too off-topic, the please have a mod delete it and kick my > butt. :) > > I've got a simple two-node cluster, each with three NICs running CentOS 5.3 > x86_64. Each node has this configuration: > > eth0 - Internal network and VM back channel (IPMI on this NIC) > eth1 - DRBD link > eth2 - Internet-facing NIC, dom0 has no IP I have a very similar setup here. > ?I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's 'eth0' > and 'eth1', respectively. I've set a static, public IP to domU's eth1 > (xenbr2) but I can't ping it from a workstation nor can the domU ping out. I > know that the network itself (outside the server) is setup right as I tested > the static IP from that cable earlier on a test machine. This sounds to me like being related only to routing stuff, rather than Xen. I don't know if what I'm going to say is already implicit in what you said, or very obvious to you, but to reach that public IP address in question, the workstation from where you are trying to do this must to have a route that leads to this IP, usually a gateway address. Try thinking about the path the packet is following across the network, and if there is routing rules enough to reach its destiny. Even from dom0, you are subjected to the same rules. For example: to ping some public IP address assigned to a domU's eth1, you should have either a IP of the same network assigned to your physical eth2/peth2 (or a virtual ethX interface atached on xenbr2 bridge), or routes that leads your packet from your dom0 to that domU's IP. > ?Could someone hit me with a clue-stick as to what I might be doing wrong? > Let me know if any logs or config files are helpful. I'm not sure if I understood your scenario correctly, and I don't know if I could be clear enough with this bad english, but at least, I tried. =) > Thanks! I hope this can help a bit. Good luck. =) From orkcu at yahoo.com Tue Oct 27 03:06:39 2009 From: orkcu at yahoo.com (Roger Pena Escobio) Date: Mon, 26 Oct 2009 20:06:39 -0700 (PDT) Subject: [Linux-cluster] High availability mail server In-Reply-To: <4AE643E6.7060101@bobich.net> Message-ID: <459616.84400.qm@web88301.mail.re4.yahoo.com> --- On Mon, 10/26/09, Gordan Bobic wrote: > From: Gordan Bobic > Subject: Re: [Linux-cluster] High availability mail server > To: "linux clustering" > Received: Monday, October 26, 2009, 8:50 PM > On 26/10/2009 23:54, Ray Burkholder > wrote: > >> > >> High avail. Mail? That's what MX records are > for.? Performance, would > >> be a side effect of multiple MXs.? Having it > "clustered" wouldn't make > >> mail deliver any quicker.? Why make something > so simple into something > >> complex? > >> > > > > Mail delivery and MX records are easy.? But once > mail is received, you have > > to get it to user's mail boxes, and users have to gain > access to the > > repository.? The repository should be 'highly > available' in some fashion: > > partitioned storage units, redundant storage, > replicated storage, backup > > storage, or whatever.? I believe that is the hard > bit:? making the > > repository 'highly available'. > > > > How do people do it? > > Here are some options you have: > > 1) Use a NAS/NFS box for shared storage - not really a > solution for high availability per se, as this becomes a > SPOF unless you mirror it somehow in realtime. Performance > over NFS will not be great even in a high state of tune due > to latency overheads. > > 2) Use a SAN with a clustered file system for shared > storage. Again, not really a solution for high availability > unless the SAN itself is mirrored, plus the performance will > not be great especially with a lot of concurrent users due > to locking latencies. > > 3) Use a SAN with exclusively mounted non-shared file > system (e.g. ext3). Performance should be reasonably good in > this case because there is no locking latency overheads or > lack of efficient caching. Note, however, that you will have > to ensure in your cluster configuration that this ext3 > volume is a service that can only be active on one machine > at a time. If it ends up accidentally multi-mounted, your > data will be gone in a matter of seconds. > > 2b) Split your user data up in such a way that a particular > user will always hit a particular server (unless that server > fails), and all the data for users on that server goes to a > particular volume, or subtree of a cluster file system (e.g. > GFS). This will ensure that all locks for that subtree can > be cached on that server, to overcome the locking latency > overheads. what about using a combination of 3 and 2b: 3b- split your users in a set of servers which use ext3 FS but are part of a cluster, the servers are really services of a cluster (IP and FS are resources of a CLuster Service) so, if a server fail its service can be migrated to another node of the cluster let say for example users starting with letter 'a' to users starting with letter 'h' will "assigned" to MailA , users from 'i' to 'z' will be assigned to MailB. MAilA --> ipA and filesystemA MAilB --> ipB and filesystemB Cluster ServiceA will have resource ipA and filesystemA Cluster ServiceB will have resource ipB and filesystemB and ServiceA will be configured to run in nodeA, while ServiceB will be set to run in nodeB of the cluster, but will be set to failover to nodeC (standby server) the hard part of this is how to balance the users between MailA and MailB (and MailC , D, E). Changing the value of "mail_host" in user attr (if using a Directory Service) and moving user's email from one filesystem to another this is just food for the brain, the scenario could be as complex as you like :-), but definitely is no good idea to have GFS for mail servers if the clients can connected from multiple sources and dont have a "proxy" to tunnel all request for same user to same backend server. thanks roger From gordan at bobich.net Tue Oct 27 09:01:26 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 27 Oct 2009 09:01:26 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <459616.84400.qm@web88301.mail.re4.yahoo.com> References: <459616.84400.qm@web88301.mail.re4.yahoo.com> Message-ID: <4AE6B6E6.5030708@bobich.net> Roger Pena Escobio wrote: > --- On Mon, 10/26/09, Gordan Bobic wrote: > >> From: Gordan Bobic >> Subject: Re: [Linux-cluster] High availability mail server >> To: "linux clustering" >> Received: Monday, October 26, 2009, 8:50 PM >> On 26/10/2009 23:54, Ray Burkholder >> wrote: >>>> High avail. Mail? That's what MX records are >> for. Performance, would >>>> be a side effect of multiple MXs. Having it >> "clustered" wouldn't make >>>> mail deliver any quicker. Why make something >> so simple into something >>>> complex? >>>> >>> Mail delivery and MX records are easy. But once >> mail is received, you have >>> to get it to user's mail boxes, and users have to gain >> access to the >>> repository. The repository should be 'highly >> available' in some fashion: >>> partitioned storage units, redundant storage, >> replicated storage, backup >>> storage, or whatever. I believe that is the hard >> bit: making the >>> repository 'highly available'. >>> >>> How do people do it? >> Here are some options you have: >> >> 1) Use a NAS/NFS box for shared storage - not really a >> solution for high availability per se, as this becomes a >> SPOF unless you mirror it somehow in realtime. Performance >> over NFS will not be great even in a high state of tune due >> to latency overheads. >> >> 2) Use a SAN with a clustered file system for shared >> storage. Again, not really a solution for high availability >> unless the SAN itself is mirrored, plus the performance will >> not be great especially with a lot of concurrent users due >> to locking latencies. >> >> 3) Use a SAN with exclusively mounted non-shared file >> system (e.g. ext3). Performance should be reasonably good in >> this case because there is no locking latency overheads or >> lack of efficient caching. Note, however, that you will have >> to ensure in your cluster configuration that this ext3 >> volume is a service that can only be active on one machine >> at a time. If it ends up accidentally multi-mounted, your >> data will be gone in a matter of seconds. >> >> 2b) Split your user data up in such a way that a particular >> user will always hit a particular server (unless that server >> fails), and all the data for users on that server goes to a >> particular volume, or subtree of a cluster file system (e.g. >> GFS). This will ensure that all locks for that subtree can >> be cached on that server, to overcome the locking latency >> overheads. > > what about using a combination of 3 and 2b: > 3b- split your users in a set of servers which use ext3 FS but are part of a cluster, the servers are really services of a cluster (IP and FS are resources of a CLuster Service) so, if a server fail its service can be migrated to another node of the cluster There's no problem with that, but 2b avoids the extra care having to be taken that the ext3 volume is only ever mounted on one node (i.e. the scope for total data loss through such an error condition is eliminated), while it will still give you nearly the same performance because DLM caches locks (i.e. it avoids the 40ns->100us latency penalty on cached locks). Gordan From jakov.sosic at srce.hr Tue Oct 27 09:36:37 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 27 Oct 2009 10:36:37 +0100 Subject: [Linux-cluster] High availability mail server In-Reply-To: <4AE56653.4060508@bobich.net> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <4AE56653.4060508@bobich.net> Message-ID: <20091027103637.4c59521e@nb-jsosic> On Mon, 26 Oct 2009 09:05:23 +0000 Gordan Bobic wrote: > This may well be fine when you are dealing with large-ish files and > your workload is arranged in such a way that accesses to particular > data subtrees is typically executed on only one node at a time So If I organize my cluster nodes that every one of them accesses one specific subtree directory of GFS/GFS2/OCFS2 partition, I would't get so slow performance like if the access patterns were all mixed up? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From jakov.sosic at srce.hr Tue Oct 27 09:38:23 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Tue, 27 Oct 2009 10:38:23 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: References: Message-ID: <20091027103823.4f23993d@nb-jsosic> On Mon, 26 Oct 2009 17:40:24 -0000 "Martin Waite" wrote: > Hi, > > I have 3 VMs running in a cluster. 4 services are defined, one of > which ("SENTINEL") is running on clusternode30. > > I then suspended clusternode30 in the VM console. Cman notices the > disappearance within a few seconds. However, the SENTINEL service > that was running is still flagged as "started". Could you please post your /var/log/messages when one node is fenced? Also, are you using Debian/Ubuntu by any chance? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From swhiteho at redhat.com Tue Oct 27 09:39:25 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 27 Oct 2009 09:39:25 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <20091027103637.4c59521e@nb-jsosic> References: <000d01ca554e$c7e4a160$57ade420$%murad@its.ws> <4AE4C59C.90908@alteeve.com> <001801ca5605$a90fa340$fb2ee9c0$%murad@its.ws> <4AE56653.4060508@bobich.net> <20091027103637.4c59521e@nb-jsosic> Message-ID: <1256636365.2669.9.camel@localhost.localdomain> Hi, On Tue, 2009-10-27 at 10:36 +0100, Jakov Sosic wrote: > On Mon, 26 Oct 2009 09:05:23 +0000 > Gordan Bobic wrote: > > > This may well be fine when you are dealing with large-ish files and > > your workload is arranged in such a way that accesses to particular > > data subtrees is typically executed on only one node at a time > > So If I organize my cluster nodes that every one of them accesses one > specific subtree directory of GFS/GFS2/OCFS2 partition, I would't get > so slow performance like if the access patterns were all mixed up? > > Yes, thats the correct solution to the caching issue, Steve. From swhiteho at redhat.com Tue Oct 27 09:57:06 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 27 Oct 2009 09:57:06 +0000 Subject: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU In-Reply-To: References: <1256568389.6052.676.camel@localhost.localdomain> Message-ID: <1256637426.2669.26.camel@localhost.localdomain> Hi, On Mon, 2009-10-26 at 15:47 -0700, Miller, Gordon K wrote: > When making our GFS2 filesystems we are using default values with the exception of the journal size which we have set to 16MB. Our resource groups are 443 MB in size for this filesystem. > > I do not believe that we have the case of unlinking inodes from one node while it is still open on another. > > Under what conditions would try_rgrp_unlink return the same inode when called repeatedly in a short time frame as seen in the original problem description? I am unable to correlate any call to gfs2_unlink on any node in the cluster with the inodes that try_rgrp_unlink is returning. > > Gordon > It depends which kernel version you have. In earlier kernels it tried to deallocate inodes in an rgrp only once for each mount of the filesystem. That proved to cause a problem for some configurations where we were not aggressive enough in reclaiming free space. As a result, the algorithm was updated to scan more often. However in both cases, it was designed to always make progress and not continue to rescan the same inode, so something very odd is going on. The only reason that an inode would be repeatedly scanned is that it has been unlinked somewhere (since the scanning is looking only for unlinked inodes) and cannot be deallocated for some reason (i.e. still in use) and thus is still there when the next scan comes along. Steve. From Martin.Waite at datacash.com Tue Oct 27 09:57:50 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Tue, 27 Oct 2009 09:57:50 -0000 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <20091027103823.4f23993d@nb-jsosic> References: <20091027103823.4f23993d@nb-jsosic> Message-ID: Hi Jakov I am running Debian Lenny 64-bit. Is that going to be a problem for me ? I think you have given me enough of a pointer - ie. I haven't configured fencing properly - to get me going again. Thanks. regards, Martin ==== Just out of interest, here are the logs: Here is the syslog from clusternode28 when I suspended clusternode30: Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: Membership Change Event Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: State change: clusternode30 DOWN Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: Membership Change Event Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: Membership Change Event Oct 26 18:29:51 clusternode28 fenced[16118]: fencing deferred to clusternode27 Then, on clusternode27: Oct 26 18:29:52 clusternode27 kernel: [438082.708458] dlm: closing connection to node 30 Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: Membership Change Event Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: State change: clusternode30 DOWN Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: Membership Change Event Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: Membership Change Event Oct 26 18:29:52 clusternode27 fenced[12749]: clusternode30 not a cluster member after 0 sec post_fail_delay Oct 26 18:29:52 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:29:52 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 26 18:29:57 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:29:57 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 26 18:30:02 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:30:02 clusternode27 fenced[12749]: fence "clusternode30" failed ... and so on ... I haven't configured fencing properly, have I ? When I un-suspended clusternode30 (15 hours later), cman on clusternode27 throws an error and quits: Oct 27 10:50:01 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:01 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:05 clusternode27 clurgmgrd[20955]: Membership Change Event Oct 27 10:50:05 clusternode27 clurgmgrd[20955]: Membership Change Event Oct 27 10:50:06 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:06 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:11 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:11 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:16 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:16 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:20 clusternode27 openais[12741]: CMAN: Joined a cluster with disallowed nodes. must die Oct 27 10:50:20 clusternode27 kernel: [496910.220602] dlm: closing connection to node 28 Oct 27 10:50:20 clusternode27 kernel: [496910.220710] dlm: closing connection to node 27 Oct 27 10:50:20 clusternode27 dlm_controld[12751]: cluster is down, exiting Oct 27 10:50:20 clusternode27 gfs_controld[12753]: groupd_dispatch error -1 errno 11 Oct 27 10:50:20 clusternode27 gfs_controld[12753]: groupd connection died Oct 27 10:50:20 clusternode27 gfs_controld[12753]: cluster is down, exiting Oct 27 10:50:47 clusternode27 ccsd[12736]: Unable to connect to cluster infrastructure after 30 seconds. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jakov Sosic Sent: 27 October 2009 09:38 To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] service state unchanged when host crashes On Mon, 26 Oct 2009 17:40:24 -0000 "Martin Waite" wrote: > Hi, > > I have 3 VMs running in a cluster. 4 services are defined, one of > which ("SENTINEL") is running on clusternode30. > > I then suspended clusternode30 in the VM console. Cman notices the > disappearance within a few seconds. However, the SENTINEL service > that was running is still flagged as "started". Could you please post your /var/log/messages when one node is fenced? Also, are you using Debian/Ubuntu by any chance? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From johannes.russek at io-consulting.net Tue Oct 27 10:44:34 2009 From: johannes.russek at io-consulting.net (jr) Date: Tue, 27 Oct 2009 11:44:34 +0100 Subject: [Linux-cluster] High availability mail server In-Reply-To: <459616.84400.qm@web88301.mail.re4.yahoo.com> References: <459616.84400.qm@web88301.mail.re4.yahoo.com> Message-ID: <1256640274.2521.4.camel@dell-jr.intern.win-rar.com> The good news is that the Cyrus IMAP Server already has a solution for that, it's called "Murder" and "Aggregator": http://cyrusimap.web.cmu.edu/ag.html regards, Johannes > what about using a combination of 3 and 2b: > 3b- split your users in a set of servers which use ext3 FS but are part of a cluster, the servers are really services of a cluster (IP and FS are resources of a CLuster Service) so, if a server fail its service can be migrated to another node of the cluster > > let say for example > users starting with letter 'a' to users starting with letter 'h' will "assigned" to MailA , users from 'i' to 'z' will be assigned to MailB. > MAilA --> ipA and filesystemA > MAilB --> ipB and filesystemB > > Cluster ServiceA will have resource ipA and filesystemA > Cluster ServiceB will have resource ipB and filesystemB > > and ServiceA will be configured to run in nodeA, while ServiceB will be set to run in nodeB of the cluster, but will be set to failover to nodeC (standby server) > > the hard part of this is how to balance the users between MailA and MailB (and MailC , D, E). Changing the value of "mail_host" in user attr (if using a Directory Service) and moving user's email from one filesystem to another > > this is just food for the brain, the scenario could be as complex as you like :-), but definitely is no good idea to have GFS for mail servers if the clients can connected from multiple sources and dont have a "proxy" to tunnel all request for same user to same backend server. > > thanks > roger > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gordan at bobich.net Tue Oct 27 11:06:45 2009 From: gordan at bobich.net (Gordan Bobic) Date: Tue, 27 Oct 2009 11:06:45 +0000 Subject: [Linux-cluster] High availability mail server Message-ID: <20091027110645.B1B53F87C4@sentinel1.shatteredsilicon.net> > the hard part of this is how to balance the users between MailA and MailB (and MailC , D, E). Changing the value of "mail_host" in user attr (if using a Directory Service) and moving user's email from one filesystem to another You don't have to if you have Maildirs on GFS. Each user will have a Maildir subtree. All you need to do is ensure that user X goes to a fixed server(X). This will ensure that only that server ever holds locks on Maildir(X), thus ensuring that locks will have local RAM latency instead of network ping time. If server(X) fails, another machine in the cluster will assume it's IP address and become it. On first access to the subtrees there'll be a latency penalty on the server that took over, but thereafter it'll run at local speeds, especially if you are running on DRBD rather than a SAN. Gordan From linux-cluster at lists.grepular.com Tue Oct 27 11:07:09 2009 From: linux-cluster at lists.grepular.com (Mike Cardwell) Date: Tue, 27 Oct 2009 11:07:09 +0000 Subject: [Linux-cluster] High availability mail server In-Reply-To: <459616.84400.qm@web88301.mail.re4.yahoo.com> References: <459616.84400.qm@web88301.mail.re4.yahoo.com> Message-ID: <4AE6D45D.4080902@lists.grepular.com> Roger Pena Escobio wrote: >>> How do people do it? >> Here are some options you have: >> >> 1) Use a NAS/NFS box for shared storage - not really a >> solution for high availability per se, as this becomes a >> SPOF unless you mirror it somehow in realtime. Performance >> over NFS will not be great even in a high state of tune due >> to latency overheads. >> >> 2) Use a SAN with a clustered file system for shared >> storage. Again, not really a solution for high availability >> unless the SAN itself is mirrored, plus the performance will >> not be great especially with a lot of concurrent users due >> to locking latencies. >> >> 3) Use a SAN with exclusively mounted non-shared file >> system (e.g. ext3). Performance should be reasonably good in >> this case because there is no locking latency overheads or >> lack of efficient caching. Note, however, that you will have >> to ensure in your cluster configuration that this ext3 >> volume is a service that can only be active on one machine >> at a time. If it ends up accidentally multi-mounted, your >> data will be gone in a matter of seconds. >> >> 2b) Split your user data up in such a way that a particular >> user will always hit a particular server (unless that server >> fails), and all the data for users on that server goes to a >> particular volume, or subtree of a cluster file system (e.g. >> GFS). This will ensure that all locks for that subtree can >> be cached on that server, to overcome the locking latency >> overheads. > > what about using a combination of 3 and 2b: > 3b- split your users in a set of servers which use ext3 FS but are part of a cluster, the servers are really services of a cluster (IP and FS are resources of a CLuster Service) so, if a server fail its service can be migrated to another node of the cluster > > let say for example > users starting with letter 'a' to users starting with letter 'h' will "assigned" to MailA , users from 'i' to 'z' will be assigned to MailB. > MAilA --> ipA and filesystemA > MAilB --> ipB and filesystemB > > Cluster ServiceA will have resource ipA and filesystemA > Cluster ServiceB will have resource ipB and filesystemB > > and ServiceA will be configured to run in nodeA, while ServiceB will be set to run in nodeB of the cluster, but will be set to failover to nodeC (standby server) > > the hard part of this is how to balance the users between MailA and MailB (and MailC , D, E). Changing the value of "mail_host" in user attr (if using a Directory Service) and moving user's email from one filesystem to another > > this is just food for the brain, the scenario could be as complex as you like :-), but definitely is no good idea to have GFS for mail servers if the clients can connected from multiple sources and dont have a "proxy" to tunnel all request for same user to same backend server. We use a mail system called CommuniGate Pro here in cluster mode. The storage is mounted onto all four of our servers via NFS. Any account can be opened on any of those four servers, but the cluster prevents that account from being opened on more than one server at a time. It forwards connections internally to the correct server. There are probably other mail servers out there that have internal clustering functionality that deals with these problems. -- Mike Cardwell - IT Consultant and LAMP developer Cardwell IT Ltd. (UK Reg'd Company #06920226) http://cardwellit.com/ Technical Blog: https://secure.grepular.com/blog/ From ccaulfie at redhat.com Tue Oct 27 11:07:28 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 27 Oct 2009 11:07:28 +0000 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 Message-ID: <4AE6D470.9010700@redhat.com> Hi all, I've been thinking about configuring clusters, particularly in a virtualised environment. The solutions we have for this are not especially good I don't think. We currently have two loaders: xmlconfig is the default and reads from /etc/cluster/cluster.conf (by default) ldapconfig reads from an LDAP server Both of these have their uses of course. But, especially in a virtualised environment, they are not particularly handy. LDAP is cumbersome and complicated to configure and we have no real tools to update it directly. Reading from local files seems a bit primitive in a virtualised system where you have at least one host system that is guaranteed to be available because it is actually running the guests themselves. Yes, we have ccs_sync that will synchronise the configuration across nodes, and that is integrated into 'cman_tool version' now, but it still seems cumbersome to me and doesn't work properly if a cluster node is down when the configuration is changed. What I think would be nice would be to update the configuration on a host node (dom0 in Xen parlance) and have the other nodes automatically pick it up from that one place. One suggested way of doing this is to host the cluster.conf file on an HTTP server on the host. The VMs can then simply GET that file and run it through xmlconfig to load it into corosync. This seems simple and easy to me. Most people know how to set up an http server (even if they don't know how to secure one!) The intention is to make this as easy as possible, so that the admin staff simply supply a URL and corosync does the right thing to fetch it when needed (at boot up and re-configure time). This magic is brought to you courtesy of pluggable configuration modules in corosync :) Comments? Better ideas? Or am I just barking ? chrissie From swhiteho at redhat.com Tue Oct 27 11:17:10 2009 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 27 Oct 2009 11:17:10 +0000 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <4AE6D470.9010700@redhat.com> References: <4AE6D470.9010700@redhat.com> Message-ID: <1256642230.2669.44.camel@localhost.localdomain> Hi, On Tue, 2009-10-27 at 11:07 +0000, Christine Caulfield wrote: > Hi all, > > I've been thinking about configuring clusters, particularly in a > virtualised environment. The solutions we have for this are not > especially good I don't think. > > We currently have two loaders: > > xmlconfig is the default and reads from /etc/cluster/cluster.conf (by > default) > > ldapconfig reads from an LDAP server > > Both of these have their uses of course. But, especially in a > virtualised environment, they are not particularly handy. LDAP is > cumbersome and complicated to configure and we have no real tools to > update it directly. Reading from local files seems a bit primitive in a > virtualised system where you have at least one host system that is > guaranteed to be available because it is actually running the guests > themselves. > > Yes, we have ccs_sync that will synchronise the configuration across > nodes, and that is integrated into 'cman_tool version' now, but it still > seems cumbersome to me and doesn't work properly if a cluster node is > down when the configuration is changed. > > What I think would be nice would be to update the configuration on a > host node (dom0 in Xen parlance) and have the other nodes automatically > pick it up from that one place. > > One suggested way of doing this is to host the cluster.conf file on an > HTTP server on the host. The VMs can then simply GET that file and run > it through xmlconfig to load it into corosync. This seems simple and > easy to me. Most people know how to set up an http server (even if they > don't know how to secure one!) > > The intention is to make this as easy as possible, so that the admin > staff simply supply a URL and corosync does the right thing to fetch it > when needed (at boot up and re-configure time). This magic is brought to > you courtesy of pluggable configuration modules in corosync :) > > Comments? Better ideas? Or am I just barking ? > > That sounds like a good idea to me. Better still if it can be extended in the future to pull other bits of config too, Steve. From arwin.tugade at csun.edu Tue Oct 27 17:18:07 2009 From: arwin.tugade at csun.edu (Arwin L Tugade) Date: Tue, 27 Oct 2009 10:18:07 -0700 Subject: [Linux-cluster] High availability mail server In-Reply-To: <1256640274.2521.4.camel@dell-jr.intern.win-rar.com> References: <459616.84400.qm@web88301.mail.re4.yahoo.com> <1256640274.2521.4.camel@dell-jr.intern.win-rar.com> Message-ID: <6708F96BBF31F846BFA56EC0AE37D62284C57D7DAA@CSUN-EX-V01.csun.edu> Sounds nice, but it also appears if you lose a backend server, then half your users can't access their mailbox. With the 2/2b, the vip would float to a healthy node and mail would still be available, expect during that moment in time where it floats. What's the end users experience at that moment? Do you failback/not failback immediately (say it got fenced)? I guess it boils down to what's acceptable. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of jr Sent: Tuesday, October 27, 2009 3:45 AM To: linux clustering Subject: Re: [Linux-cluster] High availability mail server The good news is that the Cyrus IMAP Server already has a solution for that, it's called "Murder" and "Aggregator": http://cyrusimap.web.cmu.edu/ag.html regards, Johannes > what about using a combination of 3 and 2b: > 3b- split your users in a set of servers which use ext3 FS but are part of a cluster, the servers are really services of a cluster (IP and FS are resources of a CLuster Service) so, if a server fail its service can be migrated to another node of the cluster > > let say for example > users starting with letter 'a' to users starting with letter 'h' will "assigned" to MailA , users from 'i' to 'z' will be assigned to MailB. > MAilA --> ipA and filesystemA > MAilB --> ipB and filesystemB > > Cluster ServiceA will have resource ipA and filesystemA > Cluster ServiceB will have resource ipB and filesystemB > > and ServiceA will be configured to run in nodeA, while ServiceB will be set to run in nodeB of the cluster, but will be set to failover to nodeC (standby server) > > the hard part of this is how to balance the users between MailA and MailB (and MailC , D, E). Changing the value of "mail_host" in user attr (if using a Directory Service) and moving user's email from one filesystem to another > > this is just food for the brain, the scenario could be as complex as you like :-), but definitely is no good idea to have GFS for mail servers if the clients can connected from multiple sources and dont have a "proxy" to tunnel all request for same user to same backend server. > > thanks > roger > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Tue Oct 27 18:59:33 2009 From: teigland at redhat.com (David Teigland) Date: Tue, 27 Oct 2009 13:59:33 -0500 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <4AE6D470.9010700@redhat.com> References: <4AE6D470.9010700@redhat.com> Message-ID: <20091027185933.GA28696@redhat.com> On Tue, Oct 27, 2009 at 11:07:28AM +0000, Christine Caulfield wrote: > What I think would be nice would be to update the configuration on a > host node (dom0 in Xen parlance) and have the other nodes automatically > pick it up from that one place. > > One suggested way of doing this is to host the cluster.conf file on an > HTTP server on the host. The VMs can then simply GET that file and run > it through xmlconfig to load it into corosync. This seems simple and > easy to me. Most people know how to set up an http server (even if they > don't know how to secure one!) > > The intention is to make this as easy as possible, so that the admin > staff simply supply a URL and corosync does the right thing to fetch it > when needed (at boot up and re-configure time). This magic is brought to > you courtesy of pluggable configuration modules in corosync :) > > Comments? Better ideas? Or am I just barking ? I think it's great, and I think it should be extended to allow for a single http server on say a central management server, then there's no distribution issues (this was the original intention with ccsd eons ago, but no one else liked the central config server so it was abandoned, leaving shared devices and local files. Shared devices were abandoned next and then the came the big wrong turn of trying to sync the local files.) Dave From fdinitto at redhat.com Tue Oct 27 19:21:50 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 27 Oct 2009 20:21:50 +0100 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <4AE6D470.9010700@redhat.com> References: <4AE6D470.9010700@redhat.com> Message-ID: <4AE7484E.5040908@redhat.com> Christine Caulfield wrote: > The intention is to make this as easy as possible, so that the admin > staff simply supply a URL and corosync does the right thing to fetch it > when needed (at boot up and re-configure time). This magic is brought to > you courtesy of pluggable configuration modules in corosync :) > > Comments? Better ideas? Or am I just barking ? > It seems like there is some agreement in the area. I have 2 suggestions: 1) ricci/luci already relies on a running httpd. Maybe we should consider integrating this properly so we don?t end up with 2 httpd to simply create a config and distribute it. 2) if we use httpd only to distribute cluster.conf, then I?d like to see "httploader" (or please find a better a name) being a wrapper for wget rather than a brand new piece of code. It will allow us to automatically gain access to ftp, https, and it has all the bits required for username/password handling (in case somebody wants to secure httpd properly). one note that needs investigation: In our original configloader design, we always placed ourselves in this position: xmlconfig:barmodule:cman-preconfig:bazmodule I don?t believe we ever tested: loaderfoo:xmlconfig:barmodule:cman-preconfig:bazmodule I am pretty sure it will work, but it might need extra testing and caution in order to pass the right data from loader* to *config as it increases drastically the matrix of code paths. Fabio From teigland at redhat.com Tue Oct 27 19:32:39 2009 From: teigland at redhat.com (David Teigland) Date: Tue, 27 Oct 2009 14:32:39 -0500 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <4AE7484E.5040908@redhat.com> References: <4AE6D470.9010700@redhat.com> <4AE7484E.5040908@redhat.com> Message-ID: <20091027193239.GC28696@redhat.com> On Tue, Oct 27, 2009 at 08:21:50PM +0100, Fabio M. Di Nitto wrote: > 2) if we use httpd only to distribute cluster.conf, then I?d like to see > "httploader" (or please find a better a name) being a wrapper for wget > rather than a brand new piece of code. It will allow us to automatically > gain access to ftp, https, and it has all the bits required for > username/password handling (in case somebody wants to secure httpd > properly). Yes, I like wget. Dave From bohdan at harazd.net Wed Oct 28 09:17:11 2009 From: bohdan at harazd.net (Bohdan Sydor) Date: Wed, 28 Oct 2009 10:17:11 +0100 Subject: [Linux-cluster] Progress OpenEdge and RHCS Message-ID: <4AE80C17.9090506@harazd.net> Hi all, My customer's goal is to run Progress OpenEdge 10.2A on RHEL 5.4. It should be a HA service with two nodes and FC-connected shared storage. I've reviewed the group archive and googled too, but didn't find anything suitable. Did anyone here configure Progress using RHCS? Was there anything to be careful about? Red Hat does not provide an OCF file for Progress, so I'll also be glad for any hints how to monitor the db status for OpenEdge. -- Regards Bohdan From jakov.sosic at srce.hr Wed Oct 28 10:03:37 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Wed, 28 Oct 2009 11:03:37 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: References: <20091027103823.4f23993d@nb-jsosic> Message-ID: <20091028110337.2a6d23b5@nb-jsosic> On Tue, 27 Oct 2009 09:57:50 -0000 "Martin Waite" wrote: > I am running Debian Lenny 64-bit. Is that going to be a problem for > me ? Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was two months ago, and then I had stumbled upon the following bug: http://www.mail-archive.com/linux-cluster at redhat.com/msg06018.html I don't know if they have fixed that bug... but it resembles totally to your problem... Node goes down, node gets fenced, service is seen as down by rgmanager, but there is no action to relocate it to a live cluster member. That was a start of a project for me, so after that I migrated to CentOS 5 (which is a free RHEL fork). > I think you have given me enough of a pointer - ie. I haven't > configured fencing properly - to get me going again. Thanks. I can see that from the logs now :) If you get to the point where bug that I explained earlier pops up, please share that information here so that we know the state of RHCS on Debian. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From g.bagnoli at asidev.com Wed Oct 28 10:20:00 2009 From: g.bagnoli at asidev.com (Giacomo Bagnoli) Date: Wed, 28 Oct 2009 11:20:00 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <20091028110337.2a6d23b5@nb-jsosic> References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> Message-ID: <1256725200.6265.14.camel@ubik> Il giorno mer, 28/10/2009 alle 11.03 +0100, Jakov Sosic ha scritto: > On Tue, 27 Oct 2009 09:57:50 -0000 > "Martin Waite" wrote: > > > I am running Debian Lenny 64-bit. Is that going to be a problem for > > me ? > > Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was > two months ago, and then I had stumbled upon the following bug: > > http://www.mail-archive.com/linux-cluster at redhat.com/msg06018.html > > I don't know if they have fixed that bug... but it resembles totally to > your problem... Node goes down, node gets fenced, service is seen as > down by rgmanager, but there is no action to relocate it to a live > cluster member. That was a start of a project for me, so after that I > migrated to CentOS 5 (which is a free RHEL fork). I've had this bug too using Gentoo and RHCS stable-2. After a bit of investigation I've found the bug and the solution (a small, tiny, trivial 2-line patch to fenced) I've already posted on RH bugzilla but got no response. https://bugzilla.redhat.com/show_bug.cgi?id=512512 The patch worked out for me and cluster is in production from august. Hope it helps. -- Ing. Giacomo Bagnoli Asidev s.r.l. Via Osteria Bianca, 108/A 50053 Empoli (Firenze) E-mail: g.bagnoli at asidev.com Web: www.asidev.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From Martin.Waite at datacash.com Wed Oct 28 10:21:02 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Wed, 28 Oct 2009 10:21:02 -0000 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <20091028110337.2a6d23b5@nb-jsosic> References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> Message-ID: Hi Jakov, I managed to get fencing working - at least enough for my experiments. Sure enough, I hit the same problem: clusternode30 is running service "SENTINEL" - and then is powered down at ~ 18:19 Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: Membership Change Event Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: State change: clusternode30 DOWN Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: Membership Change Event Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: Membership Change Event Oct 27 18:19:55 clusternode27 fenced[2760]: clusternode30 not a cluster member after 0 sec post_fail_delay Oct 27 18:19:55 clusternode27 fenced[2760]: fencing node "clusternode30" Oct 27 18:19:55 clusternode27 fenced[2760]: can't get node number for node p@#001 Oct 27 18:19:55 clusternode27 fenced[2760]: fence "clusternode30" success Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: 22 rules loaded (The "can't get node number" looks suspicious, but fenced claims to succeed). Next morning - it still hasn't relocated the service: Cluster Status for testcluster @ Wed Oct 28 11:13:55 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ clusternode27 27 Online, Local, rgmanager clusternode28 28 Online, rgmanager clusternode30 30 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:SCRIPT clusternode28 started service:SENTINEL clusternode30 started service:VIP clusternode27 started service:mysql_authdb_service clusternode27 started I am going to strip my config down later on so that SENTINEL is the only running service. My fencing mechanism is pretty pathetic - I have added a new fence agent that does nothing but always succeeds (which I hope is enough for this stage in my education) - but my understanding is that the sequence of events should be something like this: 1. 2. cman notices (and groupd) 3. fencing is applied to the node 4. the service is relocated - or marked as failed regards, Martin -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jakov Sosic Sent: 28 October 2009 10:04 To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] service state unchanged when host crashes On Tue, 27 Oct 2009 09:57:50 -0000 "Martin Waite" wrote: > I am running Debian Lenny 64-bit. Is that going to be a problem for > me ? Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was two months ago, and then I had stumbled upon the following bug: http://www.mail-archive.com/linux-cluster at redhat.com/msg06018.html I don't know if they have fixed that bug... but it resembles totally to your problem... Node goes down, node gets fenced, service is seen as down by rgmanager, but there is no action to relocate it to a live cluster member. That was a start of a project for me, so after that I migrated to CentOS 5 (which is a free RHEL fork). > I think you have given me enough of a pointer - ie. I haven't > configured fencing properly - to get me going again. Thanks. I can see that from the logs now :) If you get to the point where bug that I explained earlier pops up, please share that information here so that we know the state of RHCS on Debian. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Martin.Waite at datacash.com Wed Oct 28 10:24:10 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Wed, 28 Oct 2009 10:24:10 -0000 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <1256725200.6265.14.camel@ubik> References: <20091027103823.4f23993d@nb-jsosic><20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> Message-ID: Hi, This does look like the same problem: martin at clusternode27:~$ sudo /usr/sbin/cman_tool -f nodes Node Sts Inc Joined Name 27 M 44 2009-10-27 14:59:33 clusternode27 28 M 52 2009-10-27 14:59:49 clusternode28 30 X 64 clusternode30 Node has not been fenced since it went down Even though the message in syslog claims that the fencing succeeded. Thanks for the help. regards, Martin -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Giacomo Bagnoli Sent: 28 October 2009 10:20 To: linux clustering Subject: Re: [Linux-cluster] service state unchanged when host crashes Il giorno mer, 28/10/2009 alle 11.03 +0100, Jakov Sosic ha scritto: > On Tue, 27 Oct 2009 09:57:50 -0000 > "Martin Waite" wrote: > > > I am running Debian Lenny 64-bit. Is that going to be a problem for > > me ? > > Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was > two months ago, and then I had stumbled upon the following bug: > > http://www.mail-archive.com/linux-cluster at redhat.com/msg06018.html > > I don't know if they have fixed that bug... but it resembles totally > to your problem... Node goes down, node gets fenced, service is seen > as down by rgmanager, but there is no action to relocate it to a live > cluster member. That was a start of a project for me, so after that I > migrated to CentOS 5 (which is a free RHEL fork). I've had this bug too using Gentoo and RHCS stable-2. After a bit of investigation I've found the bug and the solution (a small, tiny, trivial 2-line patch to fenced) I've already posted on RH bugzilla but got no response. https://bugzilla.redhat.com/show_bug.cgi?id=512512 The patch worked out for me and cluster is in production from august. Hope it helps. -- Ing. Giacomo Bagnoli Asidev s.r.l. Via Osteria Bianca, 108/A 50053 Empoli (Firenze) E-mail: g.bagnoli at asidev.com Web: www.asidev.com From fdinitto at redhat.com Wed Oct 28 10:36:30 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 28 Oct 2009 11:36:30 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X Message-ID: <4AE81EAE.3040604@redhat.com> Hi everybody, as briefly mentioned in 3.0.4 release note, a new system to validate the configuration has been enabled in the code. What it does ------------ The general idea is to be able to perform as many sanity checks on the configuration as possible. This check allows us to spot the most common mistakes, such as typos or possibly invalid values, in cluster.conf. Configuring the validation -------------------------- The validation system is integrated in several components. It supports one config option that can take 3 values. Via init script (or /etc/sysconfig/cman or distro equivalent): CONFIG_VALIDATION=value values can be: 1) FAIL - enables a very strict check. Even a simple typo will fail to load the configuration. 2) WARN - the check is relaxed. Warnings are printed on the screen, but the cluster will continue to load. (default) 3) NONE - disable the config validation system. (discouraged!) this is equivalent to: cman_tool join/version -D(FAIL|WARN|NONE) What a user sees ---------------- The output of the validation process is very cryptic. Yes we are absolutely aware of that and we are working on making it easy to understand (if anybody has relax-ng experience, please contact us). This is the typical output from a normal startup (configuration contains no errors or warnings): [root at fedora-rh-node1 ~]# /etc/init.d/cman start join Starting cluster: Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Setting network parameters... [ OK ] Starting cman... [ OK ] [root at fedora-rh-node1 ~]# This is the output with a typo in cluster.conf (running in WARN mode): [root at fedora-rh-node1 ~]# /etc/init.d/cman start join Starting cluster: Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs... [ OK ] Setting network parameters... [ OK ] Starting cman... tempfile:22: element quorum: Relax-NG validity error : Element cluster has extra content: quorum Configuration fails to validate [ OK ] The error in this specific case is that quorum element is wrong and should be quorumd.. (for qdisk). As you can see yourself, the output is not easy to understand without a good understanding of Relax-NG. The check also happens before configuration updates using via cman_tool version. Here are 3 examples (i use -S to disable configuration synchronization on my systems): [root at fedora-rh-node1 ~]# cman_tool version -r 2 -S [root at fedora-rh-node1 ~]# cman_tool defaults to strict check, the same typo as above will abort the configuration reload: [root at fedora-rh-node4 ~]# cman_tool version -r 3 -S tempfile:22: element quorum: Relax-NG validity error : Element cluster has extra content: quorum Configuration fails to validate cman_tool: Not reloading, configuration is not valid Disable the strict check and turn errors into warnings: [root at fedora-rh-node1 ~]# cman_tool version -r 3 -S -DWARN tempfile:22: element quorum: Relax-NG validity error : Element cluster has extra content: quorum Configuration fails to validate [root at fedora-rh-node1 ~]# What to do if there are errors ------------------------------ First of all do NOT panic. This check integration is new and there might be several reasons why you see a warning (including bugs in the validation schema). Users with XML and Relax-NG experience should be able to sort it out simply. For all the others we strongly recommend you to file a bug on bugzilla.redhat.com, including /etc/cluster/cluster.conf _AND_ /usr/share/cluster/cluster.rng. This will allow us to cross check bugs in our validation code/schema and help users fixing their configuration files. Using ccs_config_validate standalone command -------------------------------------------- Validation of a configuration is an important step. ccs_config_validate is a very powerful and flexible tool, but requires understanding of the config subsystem to be used correctly. The general/average user can simply invoke ccs_config_validate with no options and will see the same results as when invoked via cman_tool. This is achieved by loading the same environment variables as cman init script and respecting those selections, it will perform the required actions. There are advanced use cases and usage of the tool, for example to migrate from one config subsystem to another (cluster.conf to ldap for example), but, generally, anyone who needs to do changes of this magnitude is also expected to have a good understanding of the configuration subsystem (a new document will be available shortly for both developers and advanced users). Please do not hesitate to ask for clarifications or report bugs. Cheers Fabio From Martin.Waite at datacash.com Wed Oct 28 11:10:07 2009 From: Martin.Waite at datacash.com (Martin Waite) Date: Wed, 28 Oct 2009 11:10:07 -0000 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <1256725200.6265.14.camel@ubik> References: <20091027103823.4f23993d@nb-jsosic><20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> Message-ID: Hi Giacomo, Given the nature of the bug, does this mean that the unpatched cluster code is unable to relocate services in the event of node failure ? If so, how can anyone be using this ? regards, Martin -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Giacomo Bagnoli Sent: 28 October 2009 10:20 To: linux clustering Subject: Re: [Linux-cluster] service state unchanged when host crashes Il giorno mer, 28/10/2009 alle 11.03 +0100, Jakov Sosic ha scritto: > On Tue, 27 Oct 2009 09:57:50 -0000 > "Martin Waite" wrote: > > > I am running Debian Lenny 64-bit. Is that going to be a problem for > > me ? > > Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was > two months ago, and then I had stumbled upon the following bug: > > http://www.mail-archive.com/linux-cluster at redhat.com/msg06018.html > > I don't know if they have fixed that bug... but it resembles totally > to your problem... Node goes down, node gets fenced, service is seen > as down by rgmanager, but there is no action to relocate it to a live > cluster member. That was a start of a project for me, so after that I > migrated to CentOS 5 (which is a free RHEL fork). I've had this bug too using Gentoo and RHCS stable-2. After a bit of investigation I've found the bug and the solution (a small, tiny, trivial 2-line patch to fenced) I've already posted on RH bugzilla but got no response. https://bugzilla.redhat.com/show_bug.cgi?id=512512 The patch worked out for me and cluster is in production from august. Hope it helps. -- Ing. Giacomo Bagnoli Asidev s.r.l. Via Osteria Bianca, 108/A 50053 Empoli (Firenze) E-mail: g.bagnoli at asidev.com Web: www.asidev.com From kkovachev at varna.net Wed Oct 28 11:36:12 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 28 Oct 2009 13:36:12 +0200 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <20091027193239.GC28696@redhat.com> References: <4AE6D470.9010700@redhat.com> <4AE7484E.5040908@redhat.com> <20091027193239.GC28696@redhat.com> Message-ID: <20091028113526.M93566@varna.net> On Tue, 27 Oct 2009 14:32:39 -0500, David Teigland wrote > On Tue, Oct 27, 2009 at 08:21:50PM +0100, Fabio M. Di Nitto wrote: > > 2) if we use httpd only to distribute cluster.conf, then I?d like to see > > "httploader" (or please find a better a name) being a wrapper for wget > > rather than a brand new piece of code. It will allow us to automatically > > gain access to ftp, https, and it has all the bits required for > > username/password handling (in case somebody wants to secure httpd > > properly). > > Yes, I like wget. > > Dave > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From kkovachev at varna.net Wed Oct 28 11:51:08 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 28 Oct 2009 13:51:08 +0200 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <20091027193239.GC28696@redhat.com> References: <4AE6D470.9010700@redhat.com> <4AE7484E.5040908@redhat.com> <20091027193239.GC28696@redhat.com> Message-ID: <20091028113723.M90906@varna.net> On Tue, 27 Oct 2009 14:32:39 -0500, David Teigland wrote > On Tue, Oct 27, 2009 at 08:21:50PM +0100, Fabio M. Di Nitto wrote: > > 2) if we use httpd only to distribute cluster.conf, then I?d like to see > > "httploader" (or please find a better a name) being a wrapper for wget > > rather than a brand new piece of code. It will allow us to automatically > > gain access to ftp, https, and it has all the bits required for > > username/password handling (in case somebody wants to secure httpd > > properly). > > Yes, I like wget. > I think it is very easy for an admin to include "wget -o /etc/cluster/cluster.conf" inside cman start or even add a hash check of the downloaded (in /tmp) file before moving it to /etc/cluster so there is no need for this to be a module for the initial boot. This will help in the case when a node was offline. For the update on a running node, multicast from/to the cluster members is much better i think. Is it possible on startup to read from the local file only the cluster name, version, multicast address and the key file, then to 'request' the current version number and new config if newer from the running nodes? > Dave > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From gianluca.cecchi at gmail.com Wed Oct 28 12:00:34 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 28 Oct 2009 13:00:34 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X Message-ID: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> On Wed, 28 Oct 2009 11:36:30 +0100 Fabio M. Di Nitto wrote: > Hi everybody, > > as briefly mentioned in 3.0.4 release note, a new system to validate the > configuration has been enabled in the code. Hello, updated my F11 today from cman-3.0.3-1.fc11.x86_64 to cman-3.0.4-1.fc11.x86_64 I noticed the messages you referred. See the attached image. But going to do a "man fenced" it seems my syntax is correct... or could I change anything? Here an excerpt of my cluster.conf The cluster then starts without any problem btw. The other node is currently powered off [root at virtfed ~]# clustat Cluster Status for kvm @ Wed Oct 28 12:41:43 2009 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ kvm1 1 Online, Local, rgmanager kvm2 2 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:DRBDNODE1 kvm1 started service:DRBDNODE2 (none) stopped Two questions: 1) now with cluster 3.x what is the correct way to update the config and propagate? ccs_tool command + cman_tool version or what? 2) I continue to have modclusterd and ricci services that die... how can I debug? Are they necessary at all in cluster 3.x? Thanks, Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From ccaulfie at redhat.com Wed Oct 28 12:39:55 2009 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 28 Oct 2009 12:39:55 +0000 Subject: [Linux-cluster] RFC: Configuration of Red Hat clusters in Cluster 3 In-Reply-To: <20091028113723.M90906@varna.net> References: <4AE6D470.9010700@redhat.com> <4AE7484E.5040908@redhat.com> <20091027193239.GC28696@redhat.com> <20091028113723.M90906@varna.net> Message-ID: <4AE83B9B.7030705@redhat.com> On 28/10/09 11:51, Kaloyan Kovachev wrote: > On Tue, 27 Oct 2009 14:32:39 -0500, David Teigland wrote >> On Tue, Oct 27, 2009 at 08:21:50PM +0100, Fabio M. Di Nitto wrote: >>> 2) if we use httpd only to distribute cluster.conf, then I?d like to see >>> "httploader" (or please find a better a name) being a wrapper for wget >>> rather than a brand new piece of code. It will allow us to automatically >>> gain access to ftp, https, and it has all the bits required for >>> username/password handling (in case somebody wants to secure httpd >>> properly). >> >> Yes, I like wget. >> > > I think it is very easy for an admin to include "wget -o > /etc/cluster/cluster.conf" inside cman start or even add a hash check of the > downloaded (in /tmp) file before moving it to /etc/cluster so there is no need > for this to be a module for the initial boot. This will help in the case when > a node was offline. > > For the update on a running node, multicast from/to the cluster members is > much better i think. > > Is it possible on startup to read from the local file only the cluster name, > version, multicast address and the key file, then to 'request' the current > version number and new config if newer from the running nodes? > Having a different protocol for initial join and updates is a non-starter, it's far too complicated and inconsistent. And if sysadmin adds the wget to the startup sequence then that only fixes things until the configuration needs updating ... then we're back to manually distributing the file, so it really doesn't solve anything. Multicasting a whole config file is also fraught with difficulties - some of which were exemplified by the now-abandoned CCSD. Chrissie From g.bagnoli at asidev.com Wed Oct 28 12:55:38 2009 From: g.bagnoli at asidev.com (Giacomo Bagnoli) Date: Wed, 28 Oct 2009 13:55:38 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> Message-ID: <1256734538.6265.61.camel@ubik> Il giorno mer, 28/10/2009 alle 11.10 +0000, Martin Waite ha scritto: > Hi Giacomo, > > Given the nature of the bug, does this mean that the unpatched cluster code is unable to relocate services in the event of node failure ? > > If so, how can anyone be using this ? > > regards, > Martin The RHEL5 branch is not affected by this bug as the code is slighty different for what I've seen, only STABLE2 branch is. I'm assuming is far more used than the STABLE2 (but, yes, I can be wrong, just guessing). I don't know if some other distributions are affected apart from gentoo and debian, or if debian and gentoo are _always_ affected... -- Ing. Giacomo Bagnoli Asidev s.r.l. Via Osteria Bianca, 108/A 50053 Empoli (Firenze) E-mail: g.bagnoli at asidev.com Web: www.asidev.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From gianluca.cecchi at gmail.com Wed Oct 28 13:41:03 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 28 Oct 2009 14:41:03 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X In-Reply-To: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> References: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> Message-ID: <561c252c0910280641l423a8dcapa3b4e7fadca60ed9@mail.gmail.com> On Wed, Oct 28, 2009 at 1:00 PM, Gianluca Cecchi wrote: > [snip] Hello, > updated my F11 today from cman-3.0.3-1.fc11.x86_64 to > cman-3.0.4-1.fc11.x86_64 > > I noticed the messages you referred. See the attached image. > Oops, here is the message image... -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: virtfed_error_cman.png Type: image/png Size: 8918 bytes Desc: not available URL: From jakov.sosic at srce.hr Wed Oct 28 13:54:04 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Wed, 28 Oct 2009 14:54:04 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> Message-ID: <20091028145404.0d48409c@nb-jsosic> On Wed, 28 Oct 2009 11:10:07 -0000 "Martin Waite" wrote: > Given the nature of the bug, does this mean that the unpatched > cluster code is unable to relocate services in the event of node > failure ? Yes, that means exactly that. > If so, how can anyone be using this ? Well, they can't :) As I've mentioned earlier, I've switched to CentOS after I found out this bug. You can rebuild your own debian package with provided patch, or you can try to convince debian package maintainers to include patch into distribution... -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From fdinitto at redhat.com Wed Oct 28 14:04:13 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 28 Oct 2009 15:04:13 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X In-Reply-To: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> References: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> Message-ID: <4AE84F5D.6070504@redhat.com> Gianluca Cecchi wrote: > On Wed, 28 Oct 2009 11:36:30 +0100 Fabio M. Di Nitto wrote: > >> Hi everybody, >> >> as briefly mentioned in 3.0.4 release note, a new system to validate the >> configuration has been enabled in the code. > > Hello, > updated my F11 today from cman-3.0.3-1.fc11.x86_64 to > cman-3.0.4-1.fc11.x86_64 How did you upgrade? The update in F-11 can?t update unless you pull in manually an unreleased version of fence-agents. > > I noticed the messages you referred. See the attached image. Pretty please, file a bugzilla as I asked in the first email with text and not images and include both requested files. As I said, it?s entirely possible that our schema is buggy, but I need to be able to track the errors and assign them to the correct maintainers. Cheers, Fabio From fdinitto at redhat.com Wed Oct 28 14:27:45 2009 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 28 Oct 2009 15:27:45 +0100 Subject: Updating cluster.conf (was Re: [Linux-cluster] ccs_config_validate in cluster 3.0.X) In-Reply-To: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> References: <561c252c0910280500y360c2ec5ud01f2ed02713c935@mail.gmail.com> Message-ID: <4AE854E1.5050401@redhat.com> Gianluca Cecchi wrote: > On Wed, 28 Oct 2009 11:36:30 +0100 Fabio M. Di Nitto wrote: > > Two questions: > 1) now with cluster 3.x what is the correct way to update the config and > propagate? ccs_tool command + cman_tool version or what? cman_tool version -r.. will take care to propagate the config for you, using ccs_sync (that?s part of ricci). > 2) I continue to have modclusterd and ricci services that die... how can I > debug? Are they necessary at all in cluster 3.x? File a bugzilla against them, with setup and so on. Fabio From gianluca.cecchi at gmail.com Wed Oct 28 15:44:22 2009 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Wed, 28 Oct 2009 16:44:22 +0100 Subject: Updating cluster.conf (was Re: [Linux-cluster] ccs_config_validate in cluster 3.0.X) Message-ID: <561c252c0910280844p51156125s7d483c66e7476561@mail.gmail.com> On Wed, 28 Oct 2009 15:27:45 +0100 Fabio M. Di Nitto wrote: > File a bugzilla against them, with setup and so on. > > Fabio Ok For cman error messages: https://bugzilla.redhat.com/show_bug.cgi?id=531489 For modclusterd (posted against cman because it seems modcluster doesn't exist as a component itself): https://bugzilla.redhat.com/show_bug.cgi?id=531491 For ricci: https://bugzilla.redhat.com/show_bug.cgi?id=531495 Bye, Gianluca -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Wed Oct 28 16:24:53 2009 From: pradhanparas at gmail.com (Paras pradhan) Date: Wed, 28 Oct 2009 11:24:53 -0500 Subject: [Linux-cluster] network issues Message-ID: <8b711df40910280924h6947c478l6a451c0e866c73c0@mail.gmail.com> >From a last couple of weeks we are having some network maintenance and upgrades in our data center. My xen cluster is in it. I have 3 nodes with DRAC fencing. So when ever there is a network problem , my cluster nodes try to fence each other and after the reboot of the nodes, I can see a mess ( no shared storage, no vms running etc etc). This is a test cluster so till now this is not a issue, but when I will have a production cluster, is there any way to handle the situation? I know if we do not have network problem i will not see this issue, but sometimes there are upgrade, maintenance etc etc. Any thoughts? Thanks ! Paras. From linux at alteeve.com Wed Oct 28 18:53:24 2009 From: linux at alteeve.com (Madison Kelly) Date: Wed, 28 Oct 2009 14:53:24 -0400 Subject: Solved: Re: [Linux-cluster] OT? Odd Xen network issue In-Reply-To: <4AE5EB77.9010602@alteeve.com> References: <4AE5EB77.9010602@alteeve.com> Message-ID: <4AE89324.3060404@alteeve.com> Madison Kelly wrote: > Hi all, > > I hope this isn't too off-topic being more a Xen than cluster > experience, but I am having trouble with the Xen mailing list and I > gather many here use Xen. If it is too off-topic, the please have a mod > delete it and kick my butt. :) > > I've got a simple two-node cluster, each with three NICs running CentOS > 5.3 x86_64. Each node has this configuration: > > eth0 - Internal network and VM back channel (IPMI on this NIC) > eth1 - DRBD link > eth2 - Internet-facing NIC, dom0 has no IP > > I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's > 'eth0' and 'eth1', respectively. I've set a static, public IP to domU's > eth1 (xenbr2) but I can't ping it from a workstation nor can the domU > ping out. I know that the network itself (outside the server) is setup > right as I tested the static IP from that cable earlier on a test machine. > > Could someone hit me with a clue-stick as to what I might be doing > wrong? Let me know if any logs or config files are helpful. > > Thanks! > > Madi Thanks to Rick and Edson for your help! I'll reply to each in a moment but I wanted to make this post to say that the problem were buggy D-Link NICs... Replaced them with Intel NICs and everything started working well. Thanks both! Madi From linux at alteeve.com Wed Oct 28 18:55:44 2009 From: linux at alteeve.com (Madison Kelly) Date: Wed, 28 Oct 2009 14:55:44 -0400 Subject: [Linux-cluster] OT? Odd Xen network issue In-Reply-To: <4AE60E2A.1060705@nerd.com> References: <4AE5EB77.9010602@alteeve.com> <4AE60E2A.1060705@nerd.com> Message-ID: <4AE893B0.2080502@alteeve.com> Rick Stevens wrote: > Madison Kelly wrote: >> Hi all, >> >> I hope this isn't too off-topic being more a Xen than cluster >> experience, but I am having trouble with the Xen mailing list and I >> gather many here use Xen. If it is too off-topic, the please have a >> mod delete it and kick my butt. :) >> >> I've got a simple two-node cluster, each with three NICs running >> CentOS 5.3 x86_64. Each node has this configuration: >> >> eth0 - Internal network and VM back channel (IPMI on this NIC) >> eth1 - DRBD link >> eth2 - Internet-facing NIC, dom0 has no IP >> >> I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's >> 'eth0' and 'eth1', respectively. I've set a static, public IP to >> domU's eth1 (xenbr2) but I can't ping it from a workstation nor can >> the domU ping out. I know that the network itself (outside the server) >> is setup right as I tested the static IP from that cable earlier on a >> test machine. >> >> Could someone hit me with a clue-stick as to what I might be doing >> wrong? Let me know if any logs or config files are helpful. > > First, see if you can ping dom0's NICs? If not, that's the first thing > to address. > > Next, are the domUs NICs up? ("ifconfig -a" will tell you). If not, > bring them up. Remember, you might have a fight if the domUs use > Gnome's NetworkManager. Make sure it's disabled and that you have the > classic configuration set up in the domUs correctly. > > Verify that first, and we'll move on from there. You might also want > to have a look at this: > > http://wiki.xensource.com/xenwiki/XenNetworking > > It might help explain things a bit or give you some other ideas. Thanks for the reply, Rick! I ran tcpdump on dom0 against the VM's vifU.1, against xenbr0 and even against vif0.2. In all cases I could see by ping requests, but as soon as I ran it against peth2 I'd get nothing. Turns out that the D-Link NIC had issues with 64bit that I somehow missed when I first built and tested the servers. New Intel NICs sorted it out right away. Madi From linux at alteeve.com Wed Oct 28 19:08:51 2009 From: linux at alteeve.com (Madison Kelly) Date: Wed, 28 Oct 2009 15:08:51 -0400 Subject: [Linux-cluster] OT? Odd Xen network issue In-Reply-To: <2fc5f090910261853x45304dbemfe7fc78fb689a35d@mail.gmail.com> References: <4AE5EB77.9010602@alteeve.com> <2fc5f090910261853x45304dbemfe7fc78fb689a35d@mail.gmail.com> Message-ID: <4AE896C3.9050904@alteeve.com> Edson Marquezani Filho wrote: > On Mon, Oct 26, 2009 at 16:33, Madison Kelly wrote: >> Hi all, >> >> I hope this isn't too off-topic being more a Xen than cluster experience, >> but I am having trouble with the Xen mailing list and I gather many here use >> Xen. If it is too off-topic, the please have a mod delete it and kick my >> butt. :) >> >> I've got a simple two-node cluster, each with three NICs running CentOS 5.3 >> x86_64. Each node has this configuration: >> >> eth0 - Internal network and VM back channel (IPMI on this NIC) >> eth1 - DRBD link >> eth2 - Internet-facing NIC, dom0 has no IP > > I have a very similar setup here. So far, it seems like a pretty decent low-cost, high availability setup. :) >> I've built a simple VM that uses 'xenbr0' and 'xenbr2' as the VM's 'eth0' >> and 'eth1', respectively. I've set a static, public IP to domU's eth1 >> (xenbr2) but I can't ping it from a workstation nor can the domU ping out. I >> know that the network itself (outside the server) is setup right as I tested >> the static IP from that cable earlier on a test machine. > > This sounds to me like being related only to routing stuff, rather than Xen. > I don't know if what I'm going to say is already implicit in what you > said, or very obvious to you, but to reach that public IP address in > question, the workstation from where you are trying to do this must to > have a route that leads to this IP, usually a gateway address. > Try thinking about the path the packet is following across the > network, and if there is routing rules enough to reach its destiny. > > Even from dom0, you are subjected to the same rules. For example: to > ping some public IP address assigned to a domU's eth1, you should have > either a IP of the same network assigned to your physical eth2/peth2 > (or a virtual ethX interface atached on xenbr2 bridge), or routes that > leads your packet from your dom0 to that domU's IP. Before I began work on Xen I had assigned the public IPs to eth2 and it had worked. For some reason, Xen + 64bit + D-Link DGE-560T = no luck. I took a cue from this email to check traffic on xenbr2 on dom0 and I could see traffic while debugging. As I mentioned to Rick, I could see the ICMP requests up to vif0.2, but not on peth2. >> Could someone hit me with a clue-stick as to what I might be doing wrong? >> Let me know if any logs or config files are helpful. > > I'm not sure if I understood your scenario correctly, and I don't know > if I could be clear enough with this bad english, but at least, I > tried. =) Pfft, your English is awesome. :) >> Thanks! > > I hope this can help a bit. Good luck. =) It did. Once I saw it hit vif0.2 I knew that bridging was working and that it was likely a hardware/driver bug. Thanks! Madi From jakov.sosic at srce.hr Wed Oct 28 20:06:52 2009 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Wed, 28 Oct 2009 21:06:52 +0100 Subject: [Linux-cluster] network issues In-Reply-To: <8b711df40910280924h6947c478l6a451c0e866c73c0@mail.gmail.com> References: <8b711df40910280924h6947c478l6a451c0e866c73c0@mail.gmail.com> Message-ID: <20091028210652.4ec83952@nb-jsosic> On Wed, 28 Oct 2009 11:24:53 -0500 Paras pradhan wrote: > From a last couple of weeks we are having some network maintenance > and upgrades in our data center. My xen cluster is in it. I have 3 > nodes with DRAC fencing. So when ever there is a network problem , my > cluster nodes try to fence each other and after the reboot of the > nodes, I can see a mess ( no shared storage, no vms running etc etc). > This is a test cluster so till now this is not a issue, but when I > will have a production cluster, is there any way to handle the > situation? I know if we do not have network problem i will not see > this issue, but sometimes there are upgrade, maintenance etc etc. > > Any thoughts? It's obvious that you have (at least) one SPOF in your setup. You must eliminate all SPOF's if you want true HA solution. Buy two independent low cost switches, trunk them, and connect all your cluster nodes to them. Set up LACP across switches, bond ethernet interfaces on your nodes, and finally connect your network to one of the switches. Then, when your network crew upgrades your equipment, your cluster will remain operating. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From travellig at gmail.com Wed Oct 28 20:10:41 2009 From: travellig at gmail.com (travellig travellig) Date: Wed, 28 Oct 2009 20:10:41 +0000 Subject: [Linux-cluster] Oracle EBS in HA mode anyone? Message-ID: <694487210910281310r3350b4e4v92164934ab6933f3@mail.gmail.com> Hi, Has anyone successfully configured Oracle EBS r12 in HA mode on RHE5? It is a realisticly speaking possible using RHCS? I have seen that Open HA Cluster from Open Solaris already implement an ha-ebs agent. Look forward to your success or non stories. Best regards, -- travellig. -------------- next part -------------- An HTML attachment was scrubbed... URL: From branimirp at gmail.com Wed Oct 28 21:08:41 2009 From: branimirp at gmail.com (Branimir) Date: Wed, 28 Oct 2009 22:08:41 +0100 Subject: [Linux-cluster] A home-grown cluster Message-ID: <4AE8B2D9.5010407@gmail.com> Hi list ;) Well, here is my problem. I configured a few productions clusters myself - mostly HP Proliant machines with ILO/ILO2. Now, I would like to do the same thing but with ordinary PC hardware (the fact is my wife wants me to reduce the number of my physical machines ;)). I have three more/less same PCs, all running CentOS 5.4 with Xen hypervisor. One of them would be a storage with tgt to handle iSCSI targets. To be honest, and what bothers me the most is that I don't have a clue what to use as a fence device? I took ILO for granted, but for me this is completly new. Can you offer some advice? Thanks in advance! Cheers! Branimir From gordan at bobich.net Wed Oct 28 23:04:39 2009 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 28 Oct 2009 23:04:39 +0000 Subject: [Linux-cluster] A home-grown cluster Message-ID: <20091028230435.CFEB3F87C4@sentinel1.shatteredsilicon.net> If you are running the cluster for purely test purposes, put all the nodes as VMs on one physical box and use the fencing agent for that vm solution. If you really need physical fencing, you can go a number if ways: 1) Raritan eRIC remote console cards. I wrote a fencing agent for them, which you can find if you google for fence_eric, or look in RH bugzilla or in cluster-devel archives. It may have made it into FC upstream head, not sure. Pros: OOB remote console Cons: expensive 2) APC UPS with network control. Pros: you get a UPS Cons: still expensive 3) Network power bar Pros: as cheap as it's likely to get without DIY-ing the hardware Cons: you may have to write your own fencing agent 4) If you are _really_ on a budget, you can build your own for pennies by wiring a relay into the power switch connector on the motherboard and driving it off a something like a dtr line on a rs232 port (if your machines have such things). Pros: Cost measured in pennies Cons: Electronics DIY-ing required, plus writing a fencing agent. It comes down to your budget vs. getting your hands dirty. HTH Gordan -----Original Message----- From: Branimir Sent: 28 October 2009 21:08 To: linux-cluster at redhat.com Subject: [Linux-cluster] A home-grown cluster Hi list ;) Well, here is my problem. I configured a few productions clusters myself - mostly HP Proliant machines with ILO/ILO2. Now, I would like to do the same thing but with ordinary PC hardware (the fact is my wife wants me to reduce the number of my physical machines ;)). I have three more/less same PCs, all running CentOS 5.4 with Xen hypervisor. One of them would be a storage with tgt to handle iSCSI targets. To be honest, and what bothers me the most is that I don't have a clue what to use as a fence device? I took ILO for granted, but for me this is completly new. Can you offer some advice? Thanks in advance! Cheers! Branimir -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From linux at alteeve.com Wed Oct 28 23:44:53 2009 From: linux at alteeve.com (Madison Kelly) Date: Wed, 28 Oct 2009 19:44:53 -0400 Subject: [Linux-cluster] A home-grown cluster In-Reply-To: <4AE8B2D9.5010407@gmail.com> References: <4AE8B2D9.5010407@gmail.com> Message-ID: <4AE8D775.5020605@alteeve.com> Branimir wrote: > Hi list ;) > > Well, here is my problem. I configured a few productions clusters myself > - mostly HP Proliant machines with ILO/ILO2. Now, I would like to do the > same thing but with ordinary PC hardware (the fact is my wife wants me > to reduce the number of my physical machines ;)). I have three more/less > same PCs, all running CentOS 5.4 with Xen hypervisor. One of them would > be a storage with tgt to handle iSCSI targets. > > To be honest, and what bothers me the most is that I don't have a clue > what to use as a fence device? I took ILO for granted, but for me this > is completly new. Can you offer some advice? > > Thanks in advance! > > Cheers! > > Branimir You know, I've been wondering the same thing now for a little while and had figured it just wasn't possible. I had seen a design a couple of years ago for a serial to reset switch home-brew fence device but I've not been able to find it since. The biggest thing about it was that the author mentioned that, on POST, the serial port would run high and trigger a reset so s/he designed a catch (a cap?) that would put a small enough delay to prevent this POST->fence issue. Personally, I've been thinking of a way to make an inexpensive USB -> reset switch, maybe even an addressable one so that I could have a hub of sorts for a 3+ nodes cluster. Back to your question; I don't know of a canned solution, but would be quite interested if you found one. If you find an answer off list, do please update this thread! Best of luck! Madi From brahadambal at gmail.com Thu Oct 29 08:24:01 2009 From: brahadambal at gmail.com (Brahadambal Srinivasan) Date: Thu, 29 Oct 2009 01:24:01 -0700 (PDT) Subject: [Linux-cluster] Let's stay in touch on LinkedIn Message-ID: <643095224.115887.1256804641029.JavaMail.app@ech3-cdn07.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - Brahadambal Confirm that you know Brahadambal Srinivasan https://www.linkedin.com/e/isd/827820032/0aLzFON6/ Every day, millions of professionals like Brahadambal Srinivasan use LinkedIn to connect with colleagues, find experts, and explore opportunities. ------ (c) 2009, LinkedIn Corporation -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkovachev at varna.net Thu Oct 29 08:57:25 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Thu, 29 Oct 2009 10:57:25 +0200 Subject: [Linux-cluster] A home-grown cluster In-Reply-To: <4AE8D775.5020605@alteeve.com> References: <4AE8B2D9.5010407@gmail.com> <4AE8D775.5020605@alteeve.com> Message-ID: <20091029084023.M898@varna.net> On Wed, 28 Oct 2009 19:44:53 -0400, Madison Kelly wrote > Branimir wrote: > > Hi list ;) > > > > Well, here is my problem. I configured a few productions clusters myself > > - mostly HP Proliant machines with ILO/ILO2. Now, I would like to do the > > same thing but with ordinary PC hardware (the fact is my wife wants me > > to reduce the number of my physical machines ;)). I have three more/less > > same PCs, all running CentOS 5.4 with Xen hypervisor. One of them would > > be a storage with tgt to handle iSCSI targets. > > > > To be honest, and what bothers me the most is that I don't have a clue > > what to use as a fence device? I took ILO for granted, but for me this > > is completly new. Can you offer some advice? > > > > Thanks in advance! > > > > Cheers! > > > > Branimir > > You know, I've been wondering the same thing now for a little while and > had figured it just wasn't possible. I had seen a design a couple of > years ago for a serial to reset switch home-brew fence device but I've > not been able to find it since. The biggest thing about it was that the > author mentioned that, on POST, the serial port would run high and > trigger a reset so s/he designed a catch (a cap?) that would put a small > enough delay to prevent this POST->fence issue. Personally, I've been > thinking of a way to make an inexpensive USB -> reset switch, maybe even > an addressable one so that I could have a hub of sorts for a 3+ nodes > cluster. > > Back to your question; I don't know of a canned solution, but would be > quite interested if you found one. If you find an answer off list, do > please update this thread! > Well it is not great solution, but a really cheep one: http://lan.neomontana-bg.com/picoip.php - this device have 16 Out ports on which you can attach up to 4 boards with 4 relays http://lan.neomontana-bg.com/RelayBoard.php The device have also 8 In ports, web interface and may be controlled with snmp for which a fence module can be easily made translated versions from google: http://translate.google.com/translate?prev=hp&hl=en&js=y&u=http%3A%2F%2Flan.neomontana-bg.com%2Fpicoip.php&sl=bg&tl=en&history_state0= http://translate.google.com/translate?prev=hp&hl=en&js=y&u=http%3A%2F%2Flan.neomontana-bg.com%2FRelayBoard.php&sl=bg&tl=en&history_state0= > Best of luck! > > Madi > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From alan.zg at gmail.com Thu Oct 29 15:23:49 2009 From: alan.zg at gmail.com (Alan A) Date: Thu, 29 Oct 2009 10:23:49 -0500 Subject: [Linux-cluster] Your Network Bonding Mode On a Cluster Message-ID: What bonding mode do you have on your cluster? I tried mode 5 and 6, and I got some dropped packets, then I switched to mode1, and still got dropped packets. Here is modprobe.conf: alias bond0 bonding #options bond0 mode=5 miimon=100 use_carrier=0 options bond0 mode=1 miimon=100 use_carrier=0 alias eth0 bnx2 #alias eth1 bnx2 alias eth2 bnx2 #alias eth3 bnx2 alias scsi_hostadapter cciss alias scsi_hostadapter1 sata_svw alias scsi_hostadapter2 lpfc alias scsi_hostadapter3 usb-storage -- Alan A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lpleiman at redhat.com Thu Oct 29 15:27:05 2009 From: lpleiman at redhat.com (Leo Pleiman) Date: Thu, 29 Oct 2009 11:27:05 -0400 (EDT) Subject: [Linux-cluster] Your Network Bonding Mode On a Cluster In-Reply-To: Message-ID: <372777935.519351256830025739.JavaMail.root@zmail07.collab.prod.int.phx2.redhat.com> Try mode 4, it creates an active-active connection that is supported by most enterprise switches without any additional configuration. Leo J Pleiman Senior Consultant Red Hat Consulting Services 410.688.3873 "Red Hat Ranked as #1 Software Vendor for Fifth Time in CIO Insight Study" ----- "Alan A" wrote: > What bonding mode do you have on your cluster? I tried mode 5 and 6, and I got some dropped packets, then I switched to mode1, and still got dropped packets. Here is modprobe.conf: > > alias bond0 bonding > #options bond0 mode=5 miimon=100 use_carrier=0 > options bond0 mode=1 miimon=100 use_carrier=0 > > alias eth0 bnx2 > #alias eth1 bnx2 > alias eth2 bnx2 > #alias eth3 bnx2 > alias scsi_hostadapter cciss > alias scsi_hostadapter1 sata_svw > alias scsi_hostadapter2 lpfc > alias scsi_hostadapter3 usb-storage > > -- > Alan A. > > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From alan.zg at gmail.com Thu Oct 29 16:47:17 2009 From: alan.zg at gmail.com (Alan A) Date: Thu, 29 Oct 2009 11:47:17 -0500 Subject: [Linux-cluster] A home-grown cluster In-Reply-To: <20091029084023.M898@varna.net> References: <4AE8B2D9.5010407@gmail.com> <4AE8D775.5020605@alteeve.com> <20091029084023.M898@varna.net> Message-ID: What about manual fencing / w SCSI fencing. That way if you share a storage device, it will support GFS, and worse case scenario, you manually reboot the manually fenced node. Good fence mechanism is just a "quick recovery" solution. Alan A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkovachev at varna.net Thu Oct 29 17:13:04 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Thu, 29 Oct 2009 19:13:04 +0200 Subject: [Linux-cluster] post fail fencing Message-ID: <20091029165439.M98275@varna.net> Hello, i would like to have one specific node to always fence any other failed node and some nodes to never try to fence. For example in 4 or 5 nodes cluster: Node1 is fencing any other failed, Node2 and Node3 will try fencing some time later (in case the failed node is Node1) and Node4/Node5 should never try to fence others What is the possible/recommended way to do this? From teigland at redhat.com Fri Oct 30 15:03:41 2009 From: teigland at redhat.com (David Teigland) Date: Fri, 30 Oct 2009 10:03:41 -0500 Subject: [Linux-cluster] post fail fencing In-Reply-To: <20091029165439.M98275@varna.net> References: <20091029165439.M98275@varna.net> Message-ID: <20091030150341.GC6484@redhat.com> On Thu, Oct 29, 2009 at 07:13:04PM +0200, Kaloyan Kovachev wrote: > Hello, > i would like to have one specific node to always fence any other failed node > and some nodes to never try to fence. For example in 4 or 5 nodes cluster: > Node1 is fencing any other failed, Node2 and Node3 will try fencing some time > later (in case the failed node is Node1) and Node4/Node5 should never try to > fence others > > What is the possible/recommended way to do this? The node with the lowest nodeid that was present in the last completed configuration will do the fencing. You may be able to exploit that fact to get what you want, but there's no explicit control over it. Dave From kkovachev at varna.net Fri Oct 30 15:50:46 2009 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Fri, 30 Oct 2009 17:50:46 +0200 Subject: [Linux-cluster] post fail fencing In-Reply-To: <20091030150341.GC6484@redhat.com> References: <20091029165439.M98275@varna.net> <20091030150341.GC6484@redhat.com> Message-ID: <20091030153650.M61885@varna.net> On Fri, 30 Oct 2009 10:03:41 -0500, David Teigland wrote > On Thu, Oct 29, 2009 at 07:13:04PM +0200, Kaloyan Kovachev wrote: > > Hello, > > i would like to have one specific node to always fence any other failed node > > and some nodes to never try to fence. For example in 4 or 5 nodes cluster: > > Node1 is fencing any other failed, Node2 and Node3 will try fencing some time > > later (in case the failed node is Node1) and Node4/Node5 should never try to > > fence others > > > > What is the possible/recommended way to do this? > > The node with the lowest nodeid that was present in the last completed > configuration will do the fencing. You may be able to exploit that fact > to get what you want, but there's no explicit control over it. > I was thinking about adding different post fail delay in each node's config. Is it possible or it will cause problems? > Dave > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From agx at sigxcpu.org Fri Oct 30 16:01:19 2009 From: agx at sigxcpu.org (Guido =?iso-8859-1?Q?G=FCnther?=) Date: Fri, 30 Oct 2009 17:01:19 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X In-Reply-To: <4AE81EAE.3040604@redhat.com> References: <4AE81EAE.3040604@redhat.com> Message-ID: <20091030160119.GA21200@bogon.sigxcpu.org> On Wed, Oct 28, 2009 at 11:36:30AM +0100, Fabio M. Di Nitto wrote: > Hi everybody, > > as briefly mentioned in 3.0.4 release note, a new system to validate the > configuration has been enabled in the code. > > What it does > ------------ > > The general idea is to be able to perform as many sanity checks on the > configuration as possible. This check allows us to spot the most common > mistakes, such as typos or possibly invalid values, in cluster.conf. This is great. For what it's worth: I've pushed Cluster 3.0.4 into Debian experimental a couple of days ago. Cheers, -- Guido From lhh at redhat.com Fri Oct 30 16:40:15 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 12:40:15 -0400 Subject: [Linux-cluster] A home-grown cluster In-Reply-To: <4AE8D775.5020605@alteeve.com> References: <4AE8B2D9.5010407@gmail.com> <4AE8D775.5020605@alteeve.com> Message-ID: <1256920815.5025.0.camel@localhost.localdomain> On Wed, 2009-10-28 at 19:44 -0400, Madison Kelly wrote: > You know, I've been wondering the same thing now for a little while and > had figured it just wasn't possible. I had seen a design a couple of > years ago for a serial to reset switch home-brew fence device but I've > not been able to find it since. The biggest thing about it was that the > author mentioned that, on POST, the serial port would run high and > trigger a reset so s/he designed a catch (a cap?) that would put a small > enough delay to prevent this POST->fence issue. Personally, I've been > thinking of a way to make an inexpensive USB -> reset switch, maybe even > an addressable one so that I could have a hub of sorts for a 3+ nodes > cluster. > > Back to your question; I don't know of a canned solution, but would be > quite interested if you found one. If you find an answer off list, do > please update this thread! WTI NPS115 on eBay? They're not supported by WTI anymore, and can be had for a song. -- Lon From lhh at redhat.com Fri Oct 30 16:52:49 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 12:52:49 -0400 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <20091028145404.0d48409c@nb-jsosic> References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> <20091028145404.0d48409c@nb-jsosic> Message-ID: <1256921569.5025.4.camel@localhost.localdomain> On Wed, 2009-10-28 at 14:54 +0100, Jakov Sosic wrote: > On Wed, 28 Oct 2009 11:10:07 -0000 > "Martin Waite" wrote: > > > Given the nature of the bug, does this mean that the unpatched > > cluster code is unable to relocate services in the event of node > > failure ? > > Yes, that means exactly that. > > > > If so, how can anyone be using this ? > > Well, they can't :) As I've mentioned earlier, I've switched to CentOS > after I found out this bug. You can rebuild your own debian package > with provided patch, or you can try to convince debian package > maintainers to include patch into distribution... FYI, this was fixed in March: http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962 -- Lon From lhh at redhat.com Fri Oct 30 16:54:07 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 12:54:07 -0400 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> Message-ID: <1256921647.5025.5.camel@localhost.localdomain> On Wed, 2009-10-28 at 10:24 +0000, Martin Waite wrote: > Hi, > > This does look like the same problem: > > martin at clusternode27:~$ sudo /usr/sbin/cman_tool -f nodes > Node Sts Inc Joined Name > 27 M 44 2009-10-27 14:59:33 clusternode27 > 28 M 52 2009-10-27 14:59:49 clusternode28 > 30 X 64 clusternode30 > Node has not been fenced since it went down > > Even though the message in syslog claims that the fencing succeeded. Your logs said fencing failed? Oct 26 18:30:02 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:30:02 clusternode27 fenced[12749]: fence "clusternode30" failed You can run: fence_ack_manual -e -n clusternode30 ...to override failed fencing. -- Lon From lhh at redhat.com Fri Oct 30 16:58:20 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 12:58:20 -0400 Subject: [Linux-cluster] some questions about rgmanager In-Reply-To: References: <29ae894c0910231120y21cb6ed1q2a106e359b0dc399@mail.gmail.com> Message-ID: <1256921900.5025.6.camel@localhost.localdomain> On Mon, 2009-10-26 at 11:05 +0000, Martin Waite wrote: > Hi Brem, > > Thanks for the pointers. > > The link to "OCF RA API Draft" appears to answer my questions. It will take a while to digest all that. Note that rgmanager doesn't implement 'monitor' (uses 'status' instead) as required by OCF and it stores its RAs in /usr/share/cluster (LSB-ish) rather than /etc/ocf/... (OCF-ish). -- Lon From g.bagnoli at asidev.com Fri Oct 30 18:01:31 2009 From: g.bagnoli at asidev.com (Giacomo Bagnoli) Date: Fri, 30 Oct 2009 19:01:31 +0100 Subject: [Linux-cluster] service state unchanged when host crashes In-Reply-To: <1256921569.5025.4.camel@localhost.localdomain> References: <20091027103823.4f23993d@nb-jsosic> <20091028110337.2a6d23b5@nb-jsosic> <1256725200.6265.14.camel@ubik> <20091028145404.0d48409c@nb-jsosic> <1256921569.5025.4.camel@localhost.localdomain> Message-ID: <1256925691.11020.9.camel@waste-bin.cnglab.net> On Fri, 2009-10-30 at 12:52 -0400, Lon Hohberger wrote: > FYI, this was fixed in March: > > http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=aee97b180e80c9f8b90b8fca63004afe3b289962 > > -- Lon Ups, didn't notice that, I did look at the git log before opening the bug but I must have missed that. Thanks for the pointer. :) Seem that the fix is post - 2.03.11 release, so that's why people on debian (and gentoo) need to apply the patch. -- Giacomo -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part URL: From lhh at redhat.com Fri Oct 30 18:33:33 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 14:33:33 -0400 Subject: [Linux-cluster] Progress OpenEdge and RHCS In-Reply-To: <4AE80C17.9090506@harazd.net> References: <4AE80C17.9090506@harazd.net> Message-ID: <1256927613.5025.175.camel@localhost.localdomain> On Wed, 2009-10-28 at 10:17 +0100, Bohdan Sydor wrote: > Hi all, > > My customer's goal is to run Progress OpenEdge 10.2A on RHEL 5.4. It > should be a HA service with two nodes and FC-connected shared storage. > I've reviewed the group archive and googled too, but didn't find > anything suitable. Did anyone here configure Progress using RHCS? Was > there anything to be careful about? Red Hat does not provide an OCF file > for Progress, so I'll also be glad for any hints how to monitor the db > status for OpenEdge. > The PostgreSQL 8 agent in 5.4 seems to work with 8.2.x of PostgreSQL, but you might want to take a look at it. I'm not intimately familiar with PostgreSQL, but I was able to get it running pretty easily using the RA recently. -- Lon From lhh at redhat.com Fri Oct 30 19:22:37 2009 From: lhh at redhat.com (Lon Hohberger) Date: Fri, 30 Oct 2009 15:22:37 -0400 Subject: [Linux-cluster] Progress OpenEdge and RHCS In-Reply-To: <1256927613.5025.175.camel@localhost.localdomain> References: <4AE80C17.9090506@harazd.net> <1256927613.5025.175.camel@localhost.localdomain> Message-ID: <1256930557.2367.1.camel@localhost.localdomain> On Fri, 2009-10-30 at 14:33 -0400, Lon Hohberger wrote: > On Wed, 2009-10-28 at 10:17 +0100, Bohdan Sydor wrote: > > Hi all, > > > > My customer's goal is to run Progress OpenEdge 10.2A on RHEL 5.4. It > > should be a HA service with two nodes and FC-connected shared storage. > > I've reviewed the group archive and googled too, but didn't find > > anything suitable. Did anyone here configure Progress using RHCS? Was > > there anything to be careful about? Red Hat does not provide an OCF file > > for Progress, so I'll also be glad for any hints how to monitor the db > > status for OpenEdge. > > > > The PostgreSQL 8 agent in 5.4 seems to work with 8.2.x of PostgreSQL, > but you might want to take a look at it. > > I'm not intimately familiar with PostgreSQL, but I was able to get it > running pretty easily using the RA recently. Oops. ;) Progress != Postgres -- Lon From gordon.k.miller at boeing.com Fri Oct 30 19:28:00 2009 From: gordon.k.miller at boeing.com (Miller, Gordon K) Date: Fri, 30 Oct 2009 12:28:00 -0700 Subject: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU In-Reply-To: <1256637426.2669.26.camel@localhost.localdomain> References: <1256568389.6052.676.camel@localhost.localdomain> <1256637426.2669.26.camel@localhost.localdomain> Message-ID: Hi, We are still struggling with the problem of try_rgrp_unlink consuming large amounts of CPU time over durations exceeding 15 minutes. We see several threads on the same node repeatedly calling try_rgrp_unlink with the same rgrp and the same group of inodes being retuned over and over until 15 minutes later one of the calls fails to find an inode and returns zero causing GFS2_RDF_CHECK in rd_flags to be cleared and thus stopping the calls to try_rgrp_unlink. We have not determined what triggers this behavior. What causes the GFS2_RDF_CHECK flag to get set? Are there any locking issues with the rgrp? Our kernel version is 2.6.18-128.7.1 Gordon -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steven Whitehouse Sent: Tuesday, October 27, 2009 2:57 AM To: linux clustering Subject: RE: [Linux-cluster] GS2 try_rgrp_unlink consuming lots of CPU Hi, On Mon, 2009-10-26 at 15:47 -0700, Miller, Gordon K wrote: > When making our GFS2 filesystems we are using default values with the exception of the journal size which we have set to 16MB. Our resource groups are 443 MB in size for this filesystem. > > I do not believe that we have the case of unlinking inodes from one node while it is still open on another. > > Under what conditions would try_rgrp_unlink return the same inode when called repeatedly in a short time frame as seen in the original problem description? I am unable to correlate any call to gfs2_unlink on any node in the cluster with the inodes that try_rgrp_unlink is returning. > > Gordon > It depends which kernel version you have. In earlier kernels it tried to deallocate inodes in an rgrp only once for each mount of the filesystem. That proved to cause a problem for some configurations where we were not aggressive enough in reclaiming free space. As a result, the algorithm was updated to scan more often. However in both cases, it was designed to always make progress and not continue to rescan the same inode, so something very odd is going on. The only reason that an inode would be repeatedly scanned is that it has been unlinked somewhere (since the scanning is looking only for unlinked inodes) and cannot be deallocated for some reason (i.e. still in use) and thus is still there when the next scan comes along. Steve. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From allen at isye.gatech.edu Fri Oct 30 23:27:23 2009 From: allen at isye.gatech.edu (Allen Belletti) Date: Fri, 30 Oct 2009 19:27:23 -0400 Subject: [Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock Message-ID: <4AEB765B.3010408@isye.gatech.edu> Hi All, As I've mentioned before, I'm running a two-node clustered mail server on GFS2 (with RHEL 5.4) Nearly all of the time, everything works great. However, going all the way back to GFS1 on RHEL 5.1 (I think it was), I've had occasional locking problems that force a reboot of one or both cluster nodes. Lately I've paid closer attention since it's been happening more often. I'll notice the problem when the load average starts rising. It's always tied to "stuck" processes, and I believe always tied to IMAP clients (I'm running Dovecot.) It seems like a file belonging to user "x" (in this case, "jforrest" will become locked in some way, such that every IMAP process tied that user will get stuck on the same thing. Over time, as the user keeps trying to read that file, more & more processes accumulate. They're always in state "D" (uninterruptible sleep), and always on "dlm_posix_lock" according to WCHAN. The only way I'm able to get out of this state is to reboot. If I let it persist for too long, I/O generally stops entirely. This certainly seems like it ought to have a definite solution, but I've no idea what it is. I've tried a variety of things using "find" to pinpoint a particular file, but everything belonging to the affected user seems just fine. At least, I can read and copy all of the files, and do a stat via ls -l. Is it possible that this is a bug, not within GFS at all, but within Dovecot IMAP? Any thoughts would be appreciated. It's been getting worse lately and thus no fun at all. Cheers, Allen From dhoffutt at gmail.com Sat Oct 31 04:04:45 2009 From: dhoffutt at gmail.com (Dustin Henry Offutt) Date: Fri, 30 Oct 2009 23:04:45 -0500 Subject: [Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock In-Reply-To: <4AEB765B.3010408@isye.gatech.edu> References: <4AEB765B.3010408@isye.gatech.edu> Message-ID: This sounds like a memory problem from the mail app or OS that runs into the cluster software. Trace running memory heaps in the dump. On Fri, Oct 30, 2009 at 6:27 PM, Allen Belletti wrote: > Hi All, > > As I've mentioned before, I'm running a two-node clustered mail server on > GFS2 (with RHEL 5.4) Nearly all of the time, everything works great. > However, going all the way back to GFS1 on RHEL 5.1 (I think it was), I've > had occasional locking problems that force a reboot of one or both cluster > nodes. Lately I've paid closer attention since it's been happening more > often. > > I'll notice the problem when the load average starts rising. It's always > tied to "stuck" processes, and I believe always tied to IMAP clients (I'm > running Dovecot.) It seems like a file belonging to user "x" (in this case, > "jforrest" will become locked in some way, such that every IMAP process tied > that user will get stuck on the same thing. Over time, as the user keeps > trying to read that file, more & more processes accumulate. They're always > in state "D" (uninterruptible sleep), and always on "dlm_posix_lock" > according to WCHAN. The only way I'm able to get out of this state is to > reboot. If I let it persist for too long, I/O generally stops entirely. > > This certainly seems like it ought to have a definite solution, but I've no > idea what it is. I've tried a variety of things using "find" to pinpoint a > particular file, but everything belonging to the affected user seems just > fine. At least, I can read and copy all of the files, and do a stat via ls > -l. > > Is it possible that this is a bug, not within GFS at all, but within > Dovecot IMAP? > > Any thoughts would be appreciated. It's been getting worse lately and thus > no fun at all. > > Cheers, > Allen > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Sat Oct 31 04:39:02 2009 From: linux at alteeve.com (Madison Kelly) Date: Sat, 31 Oct 2009 00:39:02 -0400 Subject: [Linux-cluster] Xen network config -> Fence problem Message-ID: <4AEBBF66.5070002@alteeve.com> Hi all, I've got CentOS 5.3 installed on two nodes (simple two node cluster). On this, I've got a DRBD partition running cluster aware LVM. I use this to host VMs under Xen. I've got a problem where I am trying to use eth0 as a back channel for the VMs on either node via a firewall VM. The network setup on each node is: eth0: back channel, IPMI only connected to an internal network. eth1: dedicated DRBD link. eth2: Internet-facing interface. I want to get eth0 and eth2 under Xen's networking but the default config was to leave eth0 alone. Specifically, the convirt-xen-multibridge is set to: "$dir/network-bridge" "$@" vifnum=0 netdev=peth0 bridge=xenbr0 When I change this to: "$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0 One of the nodes will soon fence the other, and when it comes back up it fences the first. Eventually one node stays up and constantly fences the other. The node that survives prints this to repeatedly to the log just before it is fenced: Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] FAILED TO RECEIVE Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] entering GATHER state from 6. And the node that stays up prints this: Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] entering GATHER state from 2. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering GATHER state from 0. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Creating commit token because I am the rep. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Saving state aru 2c high seq received 2c Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Storing new sequence id for ring 108 Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering COMMIT state. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering RECOVERY state. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] position [0] member 10.255.135.3: Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] previous ring seq 260 rep 10.255.135.2 Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] aru 2c high delivered 2c received flag 1 Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Did not need to originate any messages in recovery. Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Sending initial ORF token Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] CLM CONFIGURATION CHANGE Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] New Configuration: Oct 31 00:35:51 vsh03 kernel: dlm: closing connection to node 1 Oct 31 00:35:51 vsh03 fenced[3256]: vsh02.domain.com not a cluster member after 0 sec post_fail_delay Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] r(0) ip(10.255.135.3) Oct 31 00:35:51 vsh03 fenced[3256]: fencing node "vsh02.domain.com" If I leave it long enough, the failed node (vsh02 in this case), stops getting fenced but the Xen networking doesn't come up. Specifically, no vifX.Y, xenbrX or other devices get created. Any idea what might be going on? I really need to get eth0 virtualized so that I can get routing to work. Thanks! Madi From linux at alteeve.com Sat Oct 31 05:01:23 2009 From: linux at alteeve.com (Madison Kelly) Date: Sat, 31 Oct 2009 01:01:23 -0400 Subject: [Linux-cluster] Xen network config -> Fence problem - More info In-Reply-To: <4AEBBF66.5070002@alteeve.com> References: <4AEBBF66.5070002@alteeve.com> Message-ID: <4AEBC4A3.2030208@alteeve.com> After sending this, I went back to debugging the problem. The machines had stopped fencing and the DRBD link was down. So first I stopped and then started 'xend' and this got the Xen-type networking up. I left the machines alone for about ten minutes to see if they would fence one another, they didn't. So then I set about fixing DRBD. I got the array re-sync'ing and I thought I might have gotten things working, but about 15 or 30 seconds after getting the DRBD back online, one node fenced the other again. It may have been a coincidence, but the last command I called before one node fenced the other was 'pvdisplay' to check the LVM PVs. That command didn't return, and may have been the trigger, I am not sure. So it looks like they fence each other until DRBD breaks. Once array is fixed and/or pvdisplay is called, the fence loop starts again. Madi Madison Kelly wrote: > Hi all, > > I've got CentOS 5.3 installed on two nodes (simple two node cluster). > On this, I've got a DRBD partition running cluster aware LVM. I use this > to host VMs under Xen. > > I've got a problem where I am trying to use eth0 as a back channel for > the VMs on either node via a firewall VM. The network setup on each node > is: > > eth0: back channel, IPMI only connected to an internal network. > eth1: dedicated DRBD link. > eth2: Internet-facing interface. > > I want to get eth0 and eth2 under Xen's networking but the default > config was to leave eth0 alone. Specifically, the > convirt-xen-multibridge is set to: > > "$dir/network-bridge" "$@" vifnum=0 netdev=peth0 bridge=xenbr0 > > When I change this to: > > "$dir/network-bridge" "$@" vifnum=0 netdev=eth0 bridge=xenbr0 > > One of the nodes will soon fence the other, and when it comes back up > it fences the first. Eventually one node stays up and constantly fences > the other. > > The node that survives prints this to repeatedly to the log just > before it is fenced: > > Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] FAILED TO RECEIVE > Oct 31 00:27:21 vsh02 openais[3133]: [TOTEM] entering GATHER state from 6. > > And the node that stays up prints this: > > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] The token was lost in the > OPERATIONAL state. > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Receive multicast socket > recv buffer size (288000 bytes). > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] Transmit multicast socket > send buffer size (262142 bytes). > Oct 31 00:35:47 vsh03 openais[3237]: [TOTEM] entering GATHER state from 2. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering GATHER state from 0. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Creating commit token > because I am the rep. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Saving state aru 2c high > seq received 2c > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Storing new sequence id for > ring 108 > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering COMMIT state. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] entering RECOVERY state. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] position [0] member > 10.255.135.3: > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] previous ring seq 260 rep > 10.255.135.2 > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] aru 2c high delivered 2c > received flag 1 > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Did not need to originate > any messages in recovery. > Oct 31 00:35:51 vsh03 openais[3237]: [TOTEM] Sending initial ORF token > Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] CLM CONFIGURATION CHANGE > Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] New Configuration: > Oct 31 00:35:51 vsh03 kernel: dlm: closing connection to node 1 > Oct 31 00:35:51 vsh03 fenced[3256]: vsh02.domain.com not a cluster > member after 0 sec post_fail_delay > Oct 31 00:35:51 vsh03 openais[3237]: [CLM ] r(0) ip(10.255.135.3) > Oct 31 00:35:51 vsh03 fenced[3256]: fencing node "vsh02.domain.com" > > If I leave it long enough, the failed node (vsh02 in this case), stops > getting fenced but the Xen networking doesn't come up. Specifically, no > vifX.Y, xenbrX or other devices get created. > > Any idea what might be going on? I really need to get eth0 virtualized > so that I can get routing to work. > > Thanks! > > Madi From branimirp at gmail.com Sat Oct 31 06:38:01 2009 From: branimirp at gmail.com (Branimir) Date: Sat, 31 Oct 2009 07:38:01 +0100 Subject: [Linux-cluster] A home-grown cluster In-Reply-To: <4AE8B2D9.5010407@gmail.com> References: <4AE8B2D9.5010407@gmail.com> Message-ID: <4AEBDB49.9010603@gmail.com> Branimir wrote: > Hi list ;) > > Well, here is my problem. I configured a few productions clusters myself > - mostly HP Proliant machines with ILO/ILO2. Now, I would like to do the > same thing but with ordinary PC hardware (the fact is my wife wants me > to reduce the number of my physical machines ;)). I have three more/less > same PCs, all running CentOS 5.4 with Xen hypervisor. One of them would > be a storage with tgt to handle iSCSI targets. > > To be honest, and what bothers me the most is that I don't have a clue > what to use as a fence device? I took ILO for granted, but for me this > is completly new. Can you offer some advice? > Guys, thank you for your answers! Now, I have to decide what suits me best. Best regards, Branimir From fdinitto at redhat.com Sat Oct 31 07:18:06 2009 From: fdinitto at redhat.com (Fabio Massimo Di Nitto) Date: Sat, 31 Oct 2009 08:18:06 +0100 Subject: [Linux-cluster] ccs_config_validate in cluster 3.0.X In-Reply-To: <20091030160119.GA21200@bogon.sigxcpu.org> References: <4AE81EAE.3040604@redhat.com> <20091030160119.GA21200@bogon.sigxcpu.org> Message-ID: <4AEBE4AE.1070502@redhat.com> Guido G?nther wrote: > On Wed, Oct 28, 2009 at 11:36:30AM +0100, Fabio M. Di Nitto wrote: >> Hi everybody, >> >> as briefly mentioned in 3.0.4 release note, a new system to validate the >> configuration has been enabled in the code. >> >> What it does >> ------------ >> >> The general idea is to be able to perform as many sanity checks on the >> configuration as possible. This check allows us to spot the most common >> mistakes, such as typos or possibly invalid values, in cluster.conf. > This is great. For what it's worth: I've pushed Cluster 3.0.4 into > Debian experimental a couple of days ago. > Cheers, > -- Guido > Hi Guido, thanks for pushing the packages to Debian. Please make sure to forward bugs related to this check so we can address them quickly. Lon update the FAQ on our wiki to help debugging issues related to RelaxNG. It would be nice if you could do a package check around (corosync/openais/cluster) and send us any local patch you have. I have noticed at least corosync has one that is suitable for upstream. I didn?t have time to look at cluster. Thanks Fabio From dan at quah.ro Sat Oct 31 07:54:02 2009 From: dan at quah.ro (Dan Candea) Date: Sat, 31 Oct 2009 09:54:02 +0200 Subject: [Linux-cluster] mount.gfs2 hangs on cluster-3.0.3 Message-ID: <4AEBED1A.6040902@quah.ro> hi all I really need some help. I have set up a cluster 3.0.3 with 2.6.31 kernel All went well until I tried a gfs2 mount. The mount hangs without an error gfs_control dump reports nothing: gfs_control dump 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/gfs_controld.log 1256941054 gfs_controld 3.0.3 started 1256941054 /cluster/gfs_controld/@plock_ownership is 1 1256941054 /cluster/gfs_controld/@plock_rate_limit is 0 1256941054 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/gfs_controld.log 1256941054 group_mode 3 compat 0 an strace on mount comand it appers that gfs_control is not responding brk(0) = 0x7f86d9054000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86d71aa000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86d71a9000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=27308, ...}) = 0 mmap(NULL, 27308, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f86d71a2000 close(3) = 0 open("/lib/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\346\1\0\0\0\0\0@"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1338408, ...}) = 0 mmap(NULL, 3446712, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f86d6c46000 mprotect(0x7f86d6d86000, 2097152, PROT_NONE) = 0 mmap(0x7f86d6f86000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x140000) = 0x7f86d6f86000 mmap(0x7f86d6f8b000, 18360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f86d6f8b000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86d71a1000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f86d71a0000 arch_prctl(ARCH_SET_FS, 0x7f86d71a06f0) = 0 open("/dev/urandom", O_RDONLY) = 3 read(3, "k\6\244\266U\3731\237"..., 8) = 8 close(3) = 0 mprotect(0x7f86d6f86000, 16384, PROT_READ) = 0 mprotect(0x7f86d73b7000, 4096, PROT_READ) = 0 mprotect(0x7f86d71ab000, 4096, PROT_READ) = 0 munmap(0x7f86d71a2000, 27308) = 0 brk(0) = 0x7f86d9054000 brk(0x7f86d9076000) = 0x7f86d9076000 lstat("/dev", {st_mode=S_IFDIR|0755, st_size=3300, ...}) = 0 lstat("/dev/mapper", {st_mode=S_IFDIR|0755, st_size=80, ...}) = 0 lstat("/dev/mapper/san", {st_mode=S_IFBLK|0640, st_rdev=makedev(253, 0), ...}) = 0 brk(0x7f86d9075000) = 0x7f86d9075000 lstat("/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/var/www", {st_mode=S_IFDIR|0755, st_size=26, ...}) = 0 lstat("/var/www/superstore.to", {st_mode=S_IFDIR|0755, st_size=17, ...}) = 0 lstat("/var/www/superstore.to/data", {st_mode=S_IFDIR|0755, st_size=6, ...}) = 0 stat("/var/www/superstore.to/data", {st_mode=S_IFDIR|0755, st_size=6, ...}) = 0 open("/dev/mapper/san", O_RDONLY) = 3 lseek(3, 65536, SEEK_SET) = 65536 read(3, "\1\26\31p\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0d\0\0\0\0\0\0\7\t\0\0\7l\0"..., 512) = 512 close(3) = 0 rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0 socket(PF_FILE, SOCK_STREAM, 0) = 3 connect(3, {sa_family=AF_FILE, path=@"gfsc_sock"...}, 12) = 0 write(3, "\\o\\o\1\0\1\0\7\0\0\0\0\0\0\0`p\0\0\0\0\0\0\0\0\0\0\0\0\0\0s"..., 28768) = 28768 read(3, any idee how to proceed further? thank you -- Dan C?ndea Does God Play Dice? __________ Information from ESET NOD32 Antivirus, version of virus signature database 4559 (20091030) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __________ Information from ESET NOD32 Antivirus, version of virus signature database 4559 (20091030) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __________ Information from ESET NOD32 Antivirus, version of virus signature database 4559 (20091030) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com