From mgrac at redhat.com Mon Jul 1 13:15:02 2013 From: mgrac at redhat.com (Marek Grac) Date: Mon, 01 Jul 2013 15:15:02 +0200 Subject: [Linux-cluster] fence-agents-4.0.1 stable release Message-ID: <51D180D6.6060200@redhat.com> Welcome to the fence-agents 4.0.1 release. This release includes a few minor bug fixes: * fence agent node assassin was temporary removed * fix problem for actions for fence agents without plugs/ports * fix validation for password,password_script or identity file * fence_scsi now supports delay as other fence agents (be aware to use -H which is agent specific) * support for new fencing method in fence_dummy - type=fail - all operation should fail * improve work with invalid power states * fix fence_apc after introducing support for firmware 5.x - problem occurs on devices with more than 25 devices * command-prompt can be properly entered from user's input * fence_dummy can benefit from random delay at its start * manual page for fence_scsi was extended to provide info about 'unfence' * notice was added that command prompt is expected to be python regular expression The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.1.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. m, From gianluca.cecchi at gmail.com Thu Jul 4 15:03:24 2013 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Thu, 4 Jul 2013 17:03:24 +0200 Subject: [Linux-cluster] Info on clvmd with halvm on rhel 6.3 based clusters Message-ID: Hello, I already read these technotes so that it seems my configuration is coherent with them: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ap-ha-halvm-CA.html https://access.redhat.com/site/solutions/409813 basically I would like to use clvmd with ha-lvm (as recommended) and set up the cluster service with resources like this: The problem is that if I starts both nodes, when clvmd starts it activates all the VGs, because of action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? in init script for clvmd and $LVM_VGS empty So when the service starts, it fails in lv activation (because already active) and then the service goes in failed state. My system is registered with rhsm and bound to 6.3 release. Current packages lvm2-cluster-2.02.95-10.el6_3.3.x86_64 cman-3.0.12.1-32.el6_3.2.x86_64 lvm2-2.02.95-10.el6_3.3.x86_64 I can solve my problem if I set the clvmd init scripts as in rhel 5.9 where there is a conditional statement. Diff between original 6.3 clvmd init script and mine is now: $ diff clvmd clvmd.orig 32,34d31 < # Activate & deactivate clustered LVs < CLVMD_ACTIVATE_VOLUMES=1 < 91,92c88 < if [ -n "$CLVMD_ACTIVATE_VOLUMES" ] ; then < ${lvm_vgscan} > /dev/null 2>&1 --- > ${lvm_vgscan} > /dev/null 2>&1 94,95c90 < action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? < fi --- > action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? Then I set this in /etc/sysconfig/clvmd CLVMD_ACTIVATE_VOLUMES="" Now all seems ok in start, stop and relocate. Between technotes of 6.4 I only see this BZ #729812 Prior to this update, occasional service failures occurred when starting the clvmd variant of the HA-LVM service on multiple nodes in a cluster at the same time. The start of an HA-LVM resource coincided with another node initializing that same HA-LVM resource. With this update, a patch has been introduced to synchronize the initialization of both resources. As a result, services no longer fail due to the simultaneous initialization. but I'm not sure if it is related with my problem as it is private. Can anyone give his/her opinion? I'm going to open a case with redhat, but I would like to understand if it's me missing something trivial.... as I think I would not be the only one with this kind of configuration.... Thanks in advance, Gianluca From rmitchel at redhat.com Fri Jul 5 00:42:47 2013 From: rmitchel at redhat.com (Ryan Mitchell) Date: Fri, 05 Jul 2013 10:42:47 +1000 Subject: [Linux-cluster] Info on clvmd with halvm on rhel 6.3 based clusters In-Reply-To: References: Message-ID: <51D61687.2080204@redhat.com> Hi, On 07/05/2013 01:03 AM, Gianluca Cecchi wrote: > Hello, > I already read these technotes so that it seems my configuration is > coherent with them: > > https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ap-ha-halvm-CA.html > https://access.redhat.com/site/solutions/409813 > > basically I would like to use clvmd with ha-lvm (as recommended) and > set up the cluster service with resources like this: > > > vg_name="VG_PROVA"/> > force_fsck="0" force_unmount="1" fsid="50013" fstype="ext3 > " mountpoint="/PROVA" name="PROVA" options="" self_fence="1"/> > > > > > > > > The problem is that if I starts both nodes, when clvmd starts it > activates all the VGs, because of > > action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? > > in init script for clvmd and $LVM_VGS empty > > So when the service starts, it fails in lv activation (because already > active) and then the service goes in failed state. You aren't starting rgmanager with the -N option are you? It is not the default. # man clurgmgrd -N Do not perform stop-before-start. Combined with the -Z flag to clusvcadm, this can be used to allow rgmanager to be upgraded without stopping a given user service or set of services. What is supposed to happen is: - clvmd is started at boot time, and all clustered logical volumes are activated (including CLVM HA-LVM volumes) - rgmanager starts after clvmd, and it initializes all resources to ensure they are in a known state. For example: Jul 4 20:06:26 r6ha1 rgmanager[2478]: I am node #1 Jul 4 20:06:27 r6ha1 rgmanager[2478]: Resource Group Manager Starting Jul 4 20:06:27 r6ha1 rgmanager[2478]: Loading Service Data Jul 4 20:06:33 r6ha1 rgmanager[2478]: Initializing Services <---- Jul 4 20:06:33 r6ha1 rgmanager[3316]: [fs] stop: Could not match /dev/vgdata/lvmirror with a real device Jul 4 20:06:33 r6ha1 rgmanager[2478]: stop on fs "fsdata" returned 2 (invalid argument(s)) Jul 4 20:06:35 r6ha1 rgmanager[2478]: Services Initialized Jul 4 20:06:35 r6ha1 rgmanager[2478]: State change: Local UP Jul 4 20:06:35 r6ha1 rgmanager[2478]: State change: r6ha2.cluster.net UP - So when rgmanager starts, it stops the CLVM HA-LVM logical volumes again prior to starting the service, unless you disabled the "stop-before-start" option. I did a quick test and I got the same results as you. Can you show your resource/service definitions and the logs of when rgmanager starts up? > My system is registered with rhsm and bound to 6.3 release. > Current packages > lvm2-cluster-2.02.95-10.el6_3.3.x86_64 > cman-3.0.12.1-32.el6_3.2.x86_64 > lvm2-2.02.95-10.el6_3.3.x86_64 > > I can solve my problem if I set the clvmd init scripts as in rhel 5.9 > where there is a conditional statement. > Diff between original 6.3 clvmd init script and mine is now: > > $ diff clvmd clvmd.orig > 32,34d31 > < # Activate & deactivate clustered LVs > < CLVMD_ACTIVATE_VOLUMES=1 > < > 91,92c88 > < if [ -n "$CLVMD_ACTIVATE_VOLUMES" ] ; then > < ${lvm_vgscan} > /dev/null 2>&1 > --- >> ${lvm_vgscan} > /dev/null 2>&1 > 94,95c90 > < action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? > < fi > --- >> action "Activating VG(s):" ${lvm_vgchange} -ayl $LVM_VGS || return $? > > Then I set this in /etc/sysconfig/clvmd > CLVMD_ACTIVATE_VOLUMES="" > > Now all seems ok in start, stop and relocate. This is another option, but it shouldn't be required if rgmanager is allowed to stop the resources prior to starting the service. We could raise an RFE to add this functionality to RHEL6 if a case is opened. > > Between technotes of 6.4 I only see this > > BZ #729812 > Prior to this update, occasional service failures occurred when > starting the clvmd variant of the > HA-LVM service on multiple nodes in a cluster at the same time. The > start of an HA-LVM > resource coincided with another node initializing that same HA-LVM > resource. With this update, > a patch has been introduced to synchronize the initialization of both > resources. As a result, > services no longer fail due to the simultaneous initialization. > > but I'm not sure if it is related with my problem as it is private. This is only related to starting the HA-LVM resources simultaneously on multiple nodes, and it synchronizes them correctly so it can only start on node node. > Can anyone give his/her opinion? > I'm going to open a case with redhat, but I would like to understand > if it's me missing something trivial.... as I think I would not be the > only one with this kind of configuration.... If you open a case with Red Hat, it may find its way to me and we can troubleshoot further. > > Thanks in advance, > > Gianluca > Regards, Ryan Mitchell Red Hat Global Support Services From gianluca.cecchi at gmail.com Fri Jul 5 15:35:18 2013 From: gianluca.cecchi at gmail.com (Gianluca Cecchi) Date: Fri, 5 Jul 2013 17:35:18 +0200 Subject: [Linux-cluster] Info on clvmd with halvm on rhel 6.3 based clusters In-Reply-To: <51D61687.2080204@redhat.com> References: <51D61687.2080204@redhat.com> Message-ID: On Fri, Jul 5, 2013 at 2:42 AM, Ryan Mitchell wrote: > You aren't starting rgmanager with the -N option are you? It is not the > default. > # man clurgmgrd > -N Do not perform stop-before-start. Combined with the -Z > flag to clusvcadm, this can be used to allow rgmanager to be upgraded > without stopping a given user service or set of services. > > What is supposed to happen is: > - clvmd is started at boot time, and all clustered logical volumes are > activated (including CLVM HA-LVM volumes) > - rgmanager starts after clvmd, and it initializes all resources to ensure > they are in a known state. For example: > Jul 4 20:06:26 r6ha1 rgmanager[2478]: I am node #1 > Jul 4 20:06:27 r6ha1 rgmanager[2478]: Resource Group Manager Starting > Jul 4 20:06:27 r6ha1 rgmanager[2478]: Loading Service Data > Jul 4 20:06:33 r6ha1 rgmanager[2478]: Initializing Services > <---- > Jul 4 20:06:33 r6ha1 rgmanager[3316]: [fs] stop: Could not match > /dev/vgdata/lvmirror with a real device > Jul 4 20:06:33 r6ha1 rgmanager[2478]: stop on fs "fsdata" returned 2 > (invalid argument(s)) > Jul 4 20:06:35 r6ha1 rgmanager[2478]: Services Initialized > Jul 4 20:06:35 r6ha1 rgmanager[2478]: State change: Local UP > Jul 4 20:06:35 r6ha1 rgmanager[2478]: State change: r6ha2.cluster.net UP > - So when rgmanager starts, it stops the CLVM HA-LVM logical volumes again > prior to starting the service, unless you disabled the "stop-before-start" > option. > > I did a quick test and I got the same results as you. Can you show your > resource/service definitions and the logs of when rgmanager starts up? > > > If you open a case with Red Hat, it may find its way to me and we can > troubleshoot further. Thanks for the answer Ryan. I opened the case 00900301 as suggested. I think the problem is with the clvmd already activating lvs. My service is composed by ip resource and some and resources When the nodes start up, on the node chosen by priority definition of failover domain I get this: Jul 4 14:27:46 oraugov4 rgmanager[6469]: Services Initialized Jul 4 14:27:46 oraugov4 rgmanager[6469]: State change: Local UP Jul 4 14:27:46 oraugov4 rgmanager[6469]: Starting stopped service service:MYSERVICE Jul 4 14:27:48 oraugov4 rgmanager[9436]: [lvm] Failed to activate logical volume, VG_UGDMPRO_TEMP/LV_UGDMPRO_TEMP Jul 4 14:27:48 oraugov4 rgmanager[9458]: [lvm] Attempting cleanup of VG_UGDMPRO_TEMP/LV_UGDMPRO_TEMP Jul 4 14:27:49 oraugov4 rgmanager[9484]: [lvm] Failed second attempt to activate VG_UGDMPRO_TEMP/LV_UGDMPRO_TEMP Jul 4 14:27:49 oraugov4 rgmanager[6469]: start on lvm "LV_UGDMPRO_TEMP" returned 1 (generic error) Jul 4 14:27:49 oraugov4 rgmanager[6469]: #68: Failed to start service:MYSERVICE; return value: 1 Jul 4 14:27:49 oraugov4 rgmanager[6469]: Stopping service service:MYSERVICE Jul 4 14:27:49 oraugov4 rgmanager[9557]: [fs] stop: Could not match /dev/VG_PROVA/lv_prova with a real device Jul 4 14:27:49 oraugov4 rgmanager[6469]: stop on fs "PROVA" returned 2 (invalid argument(s)) Jul 4 14:27:49 oraugov4 rgmanager[9594]: [fs] stop: Could not match /dev/VG_UGDMPRE_RDOF/LV_UGDMPRE_RDOF with a real device Jul 4 14:27:49 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_RDOF" returned 2 (invalid argument(s)) Jul 4 14:27:49 oraugov4 rgmanager[9631]: [fs] stop: Could not match /dev/VG_UGDMPRE_REDO/LV_UGDMPRE_REDO with a real device Jul 4 14:27:49 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_REDO" returned 2 (invalid argument(s)) Jul 4 14:27:49 oraugov4 rgmanager[9669]: [fs] stop: Could not match /dev/VG_UGDMPRE_DATA/LV_UGDMPRE_DATA with a real device Jul 4 14:27:49 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_DATA" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9706]: [fs] stop: Could not match /dev/VG_UGDMPRE_SAVE/LV_UGDMPRE_SAVE with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_SAVE" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9743]: [fs] stop: Could not match /dev/VG_UGDMPRE_CTRL/LV_UGDMPRE_CTRL with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_CTRL" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9780]: [fs] stop: Could not match /dev/VG_UGDMPRE_TEMP/LV_UGDMPRE_TEMP with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRE_TEMP" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9817]: [fs] stop: Could not match /dev/VG_UGDMPRO_RDOF/LV_UGDMPRO_RDOF with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_RDOF" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9854]: [fs] stop: Could not match /dev/VG_UGDMPRO_REDO/LV_UGDMPRO_REDO with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_REDO" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9891]: [fs] stop: Could not match /dev/VG_UGDMPRO_DATA/LV_UGDMPRO_DATA with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_DATA" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9928]: [fs] stop: Could not match /dev/VG_UGDMPRO_SAVE/LV_UGDMPRO_SAVE with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_SAVE" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[9965]: [fs] stop: Could not match /dev/VG_UGDMPRO_CTRL/LV_UGDMPRO_CTRL with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_CTRL" returned 2 (invalid argument(s)) Jul 4 14:27:50 oraugov4 rgmanager[10002]: [fs] stop: Could not match /dev/VG_UGDMPRO_TEMP/LV_UGDMPRO_TEMP with a real device Jul 4 14:27:50 oraugov4 rgmanager[6469]: stop on fs "UGDMPRO_TEMP" returned 2 (invalid argument(s)) Jul 4 14:27:53 oraugov4 rgmanager[6469]: State change: icloraugov3 UP Jul 4 14:28:11 oraugov4 rgmanager[6469]: #12: RG service:MYSERVICE failed to stop; intervention required So I think I have double problem: 1) lv fails to activate because already active 2) then to solve the problem it tries to stop resources but fs.sh fails because it seems there is no related lv under it I think during the stop it should reverse order, so it should stop fs first (and it should get a result of already stopped) and only after it should deactivate the related lv... or not? Gianluca From mgrac at redhat.com Tue Jul 9 21:48:33 2013 From: mgrac at redhat.com (Marek Grac) Date: Tue, 09 Jul 2013 23:48:33 +0200 Subject: [Linux-cluster] fence_ovh - Fence agent for OVH (Proxmox 3) In-Reply-To: <498927.2707.1372269310026.JavaMail.adrian@adrianworktop> References: <498927.2707.1372269310026.JavaMail.adrian@adrianworktop> Message-ID: <51DC8531.6080404@redhat.com> Hi Adrian, On 06/26/2013 07:55 PM, Adrian Gibanel wrote: > I've improved my former fence_ovh script so that it works in Proxmox 3 and so that it uses suds library as I was suggested in the linux-cluster mailing list. > > 1) What is fence_ovh > > fence_ovh is a fence agent based on python for the big French datacentre provider OVH. You can get information about OVH on: http://www.ovh.co.uk/ . I also wanted to make clear that I'm not part of official OVH staff. > Thanks, you have done a great job in that rewrite. I have modified it a little to better fit into our existing infrastructure (--verbose, --plug). The only real change that I have added is that SOAP is not disconnected after every operation. Please take a look at it and (very likely) fix minor errors which I have introduced as I was not able to test it. m, -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-fence_ovh-New-fence-agent-for-OVH.patch Type: text/x-patch Size: 6371 bytes Desc: not available URL: From adrian.gibanel at btactic.com Thu Jul 11 17:48:55 2013 From: adrian.gibanel at btactic.com (Adrian Gibanel) Date: Thu, 11 Jul 2013 19:48:55 +0200 (CEST) Subject: [Linux-cluster] fence_ovh - Fence agent for OVH (Proxmox 3) In-Reply-To: References: Message-ID: <16433523.1103.1373564934098.JavaMail.adrian@adrianworktop> ----- Mensaje original ----- > Hi Adrian, > On 06/26/2013 07:55 PM, Adrian Gibanel wrote: > > I've improved my former fence_ovh script so that it works in > > Proxmox 3 and so that it uses suds library as I was suggested in > > the linux-cluster mailing list. > > > > 1) What is fence_ovh > > > > fence_ovh is a fence agent based on python for the big French > > datacentre provider OVH. You can get information about OVH on: > > http://www.ovh.co.uk/ . I also wanted to make clear that I'm not > > part of official OVH staff. > > > Thanks, you have done a great job in that rewrite. I have modified it > a > little to better fit into our existing infrastructure (--verbose, > --plug). The only real change that I have added is that SOAP is not > disconnected after every operation. Please take a look at it and > (very > likely) fix minor errors which I have introduced as I was not able to > test it. Thank you Marek! About the SOAP disconnection it's ok you not logging out at each time but I think the soap login should be tried just before calling the reboot_time function. The reason is that I'm afraid that 150 or 240 seconds are long enough for session to timeout. Maybe they are not but I prefer to be in the safe side. I have not tested it yet too but I've seen some changes that can made to it: elif options["--action"] in ['on', 'off' ]: should be: elif options["--action"] in ['on', 'reboot' ]: And at: session = soap.service.login(options["--username"], options["--password"], 'es', 0) You should use 'en' instead of 'es' so that default errors are printed in English by default. Again, thank you! -- -- Adri?n Gibanel I.T. Manager +34 675 683 301 www.btactic.com Ens podeu seguir a/Nos podeis seguir en: i Abans d?imprimir aquest missatge, pensa en el medi ambient. El medi ambient ?s cosa de tothom. / Antes de imprimir el mensaje piensa en el medio ambiente. El medio ambiente es cosa de todos. AVIS: El contingut d'aquest missatge i els seus annexos ?s confidencial. Si no en sou el destinatari, us fem saber que est? prohibit utilitzar-lo, divulgar-lo i/o copiar-lo sense tenir l'autoritzaci? corresponent. Si heu rebut aquest missatge per error, us agrairem que ho feu saber immediatament al remitent i que procediu a destruir el missatge . AVISO: El contenido de este mensaje y de sus anexos es confidencial. Si no es el destinatario, les hacemos saber que est? prohibido utilizarlo, divulgarlo y/o copiarlo sin tener la autorizaci?n correspondiente. Si han recibido este mensaje por error, les agradecer?amos que lo hagan saber inmediatamente al remitente y que procedan a destruir el mensaje . From linuxtovishesh at gmail.com Sun Jul 14 12:28:37 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Sun, 14 Jul 2013 17:58:37 +0530 Subject: [Linux-cluster] Getting error in luci Message-ID: Hi Members, I am getting below error in luci web interface. Can you please let me know the reasons for same ++++++++++++++++++++++ An error occurred during the creation of cluster "vk" while updating the luci database: An operation previously failed, with traceback: File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap self.__bootstrap_inner() File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner self.run() File "/usr/lib/python2.6/threading.py", line 484, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.6/site-packages/paste/httpserver.py", line 878, in worker_thread_callback runnable() File "/usr/lib/python2.6/site-packages/paste/httpserver.py", line 1052, in +++++++++++++++++++++++++++++++++++++ -- Thanks Vishesh Kumar -------------- next part -------------- An HTML attachment was scrubbed... URL: From hsiddiqi at gmail.com Mon Jul 15 08:54:40 2013 From: hsiddiqi at gmail.com (Hammad Siddiqi) Date: Mon, 15 Jul 2013 13:54:40 +0500 Subject: [Linux-cluster] :BUG: soft lockup - CPU#0 stuck for 67s! [vm.sh:29764] Message-ID: Geniuses, I have a Redhat cluster setup for VMs running on KVM. during the live migration I have come across a kernel bug related to soft lockup of CPU # 0. Please see the back trace from abrt tool below. The host specs are: Supermicro Server with AMD Opteron processor (48 cores) RAM ECC 512 GB 6.4 x86_64 Disk images stored on Netapp volumes shared via NFS on 10Gbps network The issue may not be related to Clustering Suite (looks like kernel related) but any help in pointing to the right direction will highly be appreciated. Please let me know if you require additional information/logs/output Thank you Hammad Siddiqi abrt_version: 2.0.8 cmdline: ro root=/dev/mapper/VolGroup-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=VolGroup/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=161M at 0M rd_LVM_LV=VolGroup/lv_root KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet comment: During live migration of KVM VMs (13 VMs at a time) kernel: 2.6.32-358.6.2.el6.x86_64 logfile: time: Mon 15 Jul 2013 12:55:20 AM PDT sosreport.tar.xz: Binary file, 3153956 bytes backtrace: :BUG: soft lockup - CPU#0 stuck for 67s! [vm.sh:29764] :Modules linked in: act_police cls_u32 sch_ingress cls_fw sch_htb ip6table_filter ip6_tables ebtable_nat ebtables bridge nfs lockd fscache auth_rpcgss nfs_acl dlm configfs sunrpc iptable_filter ip_tables openvswitch xsvhba(U) scsi_transport_fc scsi_tgt xve(U) xsvnic(U) bonding ipv6 8021q garp stp llc xscore(U) ib_cm mlx4_ib ib_sa ib_mad ib_core vhost_net macvtap macvlan tun kvm_amd kvm igb dca ptp pps_core mlx4_core sg serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci usb_storage dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] :CPU 0 :Modules linked in: act_police cls_u32 sch_ingress cls_fw sch_htb ip6table_filter ip6_tables ebtable_nat ebtables bridge nfs lockd fscache auth_rpcgss nfs_acl dlm configfs sunrpc iptable_filter ip_tables openvswitch xsvhba(U) scsi_transport_fc scsi_tgt xve(U) xsvnic(U) bonding ipv6 8021q garp stp llc xscore(U) ib_cm mlx4_ib ib_sa ib_mad ib_core vhost_net macvtap macvlan tun kvm_amd kvm igb dca ptp pps_core mlx4_core sg serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci usb_storage dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] :Pid: 29764, comm: vm.sh Not tainted 2.6.32-358.6.2.el6.x86_64 #1 Supermicro H8QG6/H8QG6 :RIP: 0010:[] [] wait_for_rqlock+0x2c/0x40 :RSP: 0018:ffff887a9febbeb8 EFLAGS: 00000202 :RAX: 0000000003d503b2 RBX: ffff887a9febbeb8 RCX: ffff880028216700 :RDX: 00000000000003d5 RSI: 0000000000000056 RDI: 0000000000000000 :RBP: ffffffff8100bb8e R08: ffff887bd174b500 R09: 0000000000000000 :R10: 0000000000000001 R11: 00000000000004fd R12: ffffffff00000000 :R13: 0000000000007444 R14: ffff887b00040001 R15: 0000000000000011 :FS: 00007f3bf82ec700(0000) GS:ffff880028200000(0000) knlGS:0000000000000000 :CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 :CR2: 00007f3bf79250a0 CR3: 0000000001a85000 CR4: 00000000000007f0 :DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 :DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 :Process vm.sh (pid: 29764, threadinfo ffff887a9feba000, task ffff887bd174b500) :Stack: :ffff887a9febbf38 ffffffff8107382b ffff888007203668 ffff887a9febbef8 : 00007fff8bf63cdc ffff887bd174b9c8 ffff887bd174b9c8 0000000000000000 : ffff887a9febbef8 ffff887a9febbef8 0000000001395020 0000000000000000 :Call Trace: :[] ? do_exit+0x5ab/0x870 :[] ? do_group_exit+0x58/0xd0 :[] ? sys_exit_group+0x17/0x20 :[] ? system_call_fastpath+0x16/0x1b :Code: 48 89 e5 0f 1f 44 00 00 48 c7 c0 00 67 01 00 65 48 8b 0c 25 b0 e0 00 00 0f ae f0 48 01 c1 eb 09 0f 1f 80 00 00 00 00 f3 90 8b 01 <89> c2 c1 fa 10 66 39 c2 75 f2 c9 c3 0f 1f 84 00 00 00 00 00 55 END: -------------- next part -------------- An HTML attachment was scrubbed... URL: From jpokorny at redhat.com Mon Jul 15 17:08:17 2013 From: jpokorny at redhat.com (Jan =?utf-8?Q?Pokorn=C3=BD?=) Date: Mon, 15 Jul 2013 19:08:17 +0200 Subject: [Linux-cluster] Getting error in luci In-Reply-To: References: Message-ID: <20130715170817.GA14067@redhat.com> Hello Vishesh, On 14/07/13 17:58 +0530, Vishesh kumar wrote: > I am getting below error in luci web interface. Can you please let > me know the reasons for same [following slightly modified in-place] > +++++++++++++++++++++++++++++++++++++ > An error occurred during the creation of cluster "vk" while updating the > luci database: An operation previously failed, with traceback: > > File "/usr/lib/python2.6/threading.py", line 504, > in __bootstrap > self.__bootstrap_inner() > File "/usr/lib/python2.6/threading.py", line 532, > in __bootstrap_inner > self.run() > File "/usr/lib/python2.6/threading.py", line 484, > in run > self.__target(*self.__args, **self.__kwargs) > File "/usr/lib/python2.6/site-packages/paste/httpserver.py", line 878, > in worker_thread_callback > runnable() > File "/usr/lib/python2.6/site-packages/paste/httpserver.py", line 1052, > in [reconstructed: process_request] > [reconstructed: > self.process_request_in_thread(request, client_address)] > [from now on, cannot be reconstructed reliably but expecting up to > tens of subsequent frames] > +++++++++++++++++++++++++++++++++++++ If I understand it correctly, this is what you get directly in the luci interface. As you can see, the traceback is not complete, but it may get trimmed somewhere on the way from source to the web browser (side question: is the snippet you provided really complete, i.e., not followed by the rest of expected traceback?). Anyway, authoritative source to check for details about the problems in luci is its log file located at /var/log/luci/luci.log by default. Could you please watch this file while reproducing the issue (best by issuing "tail -f /var/log/luci/luci.log" in a separete terminal) and provide the respective part of the log? This might help a lot. If you are only managing a single cluster or so in luci, perhaps I would recommend you to drop existing luci-internal database and start all over: service luci stop rm -i /var/lib/luci/data/luci.db service luci start Hope this helps. -- Jan From mgrac at redhat.com Wed Jul 17 15:47:12 2013 From: mgrac at redhat.com (Marek Grac) Date: Wed, 17 Jul 2013 17:47:12 +0200 Subject: [Linux-cluster] fence_ovh - Fence agent for OVH (Proxmox 3) In-Reply-To: <16433523.1103.1373564934098.JavaMail.adrian@adrianworktop> References: <16433523.1103.1373564934098.JavaMail.adrian@adrianworktop> Message-ID: <51E6BC80.9030805@redhat.com> Hi, On 07/11/2013 07:48 PM, Adrian Gibanel wrote: > About the SOAP disconnection it's ok you not logging out at each time but I think the soap login should be tried just before calling the reboot_time function. The reason is that I'm afraid that 150 or 240 seconds are long enough for session to timeout. Maybe they are not but I prefer to be in the safe side. Ok, we will start with original version when login/logout is done several times it should not impact fencing a lot. Later we can test if it works correctly with single login or not. > I have not tested it yet too but I've seen some changes that can made to it: > > elif options["--action"] in ['on', 'off' ]: > should be: > elif options["--action"] in ['on', 'reboot' ]: > > And at: > > session = soap.service.login(options["--username"], options["--password"], 'es', 0) > > You should use 'en' instead of 'es' so that default errors are printed in English by default. > fixed Fence agent is now upstream git - it will be part of next release 4.0.2 at the end of the month. m, From linuxtovishesh at gmail.com Thu Jul 18 11:48:38 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Thu, 18 Jul 2013 17:18:38 +0530 Subject: [Linux-cluster] Getting error in luci In-Reply-To: <20130715170817.GA14067@redhat.com> References: <20130715170817.GA14067@redhat.com> Message-ID: On Mon, Jul 15, 2013 at 10:38 PM, Jan Pokorn? wrote: > rm -i /var/lib/luci/data/luci.db Thanks Jan . It worked after removing /var/lib/luci/data/luci.db Thanks -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From linuxtovishesh at gmail.com Thu Jul 18 11:54:34 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Thu, 18 Jul 2013 17:24:34 +0530 Subject: [Linux-cluster] fence_xvm nopt working Message-ID: Hi All, I am trying to implement fence_xvm using backend libvirt. Everything is setup fine and fence_virt.conf have following configuration ++++++++++++++++++++++++++++++++++ backends { libvirt { uri = "qemu:///system"; } } listeners { multicast { port = "1229"; family = "ipv4"; address = "225.0.0.12"; key_file = "/etc/cluster/fence_xvm.key"; } } fence_virtd { module_path = "/usr/lib64/fence-virt"; backend = "libvirt"; listener = "multicast";} ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ But this does not work as daeom fence_virtd immediately after starting. I am unable to find any log as well. Changing backend to checkpoint resolve the issue of fence_virtd stoppage, but i have no idea to implement checkpoint backend. -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From adel.benzarrouk at gmail.com Thu Jul 18 12:05:48 2013 From: adel.benzarrouk at gmail.com (Adel Ben Zarrouk) Date: Thu, 18 Jul 2013 13:05:48 +0100 Subject: [Linux-cluster] Unable to connect to hp blade system with fence_hpblade (RHEL 6.4) Message-ID: Hello, I am trying to connect to Onboard administration of HP blade system using fence_hpblade tool , but I am getting the message: unable/connect to fence device. I was able to connect using ssh. Please any advice or recommendation. Regards --Adel On Thu, Jul 18, 2013 at 12:48 PM, Vishesh kumar wrote: > > On Mon, Jul 15, 2013 at 10:38 PM, Jan Pokorn? wrote: > >> rm -i /var/lib/luci/data/luci.db > > > > Thanks Jan . It worked after removing /var/lib/luci/data/luci.db > > Thanks > -- > http://linuxmantra.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Thu Jul 18 13:56:12 2013 From: lists at alteeve.ca (Digimer) Date: Thu, 18 Jul 2013 09:56:12 -0400 Subject: [Linux-cluster] Unable to connect to hp blade system with fence_hpblade (RHEL 6.4) In-Reply-To: References: Message-ID: <51E7F3FC.7060606@alteeve.ca> Can you share the exact 'fence_hpblade ' line you're sending as well as any output you get back? I am not familiar with that agent, but most have a verbose or debug mode that will return more output. Some agents need to have the command prompt string defined or extra args like 'lanplus' for iLO defined. The man page should provide some insight. digimer On 18/07/13 08:05, Adel Ben Zarrouk wrote: > Hello, > > I am trying to connect to Onboard administration of HP blade system > using fence_hpblade tool , but I am getting the message: > > unable/connect to fence device. > > I was able to connect using ssh. > > Please any advice or recommendation. > > Regards > > --Adel > > > On Thu, Jul 18, 2013 at 12:48 PM, Vishesh kumar > > wrote: > > > On Mon, Jul 15, 2013 at 10:38 PM, Jan Pokorn? > wrote: > > rm -i /var/lib/luci/data/luci.db > > > > Thanks Jan . It worked after removing /var/lib/luci/data/luci.db > > Thanks > -- > http://linuxmantra.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lists at alteeve.ca Thu Jul 18 13:57:47 2013 From: lists at alteeve.ca (Digimer) Date: Thu, 18 Jul 2013 09:57:47 -0400 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: References: Message-ID: <51E7F45B.7040602@alteeve.ca> On 18/07/13 07:54, Vishesh kumar wrote: > Hi All, > > I am trying to implement fence_xvm using backend libvirt. Everything is > setup fine and fence_virt.conf have following configuration > ++++++++++++++++++++++++++++++++++ > > backends { > libvirt { > uri = "qemu:///system"; > } > > } > > listeners { > multicast { > port = "1229"; > family = "ipv4"; > address = "225.0.0.12"; > key_file = "/etc/cluster/fence_xvm.key"; > } > > } > > fence_virtd { > module_path = "/usr/lib64/fence-virt"; > backend = "libvirt"; > listener = "multicast"; > } > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > But this does not work as daeom fence_virtd immediately after starting. > I am unable to find any log as well. > > Changing backend to checkpoint resolve the issue of fence_virtd > stoppage, but i have no idea to implement checkpoint backend. This is a not-quite-finished tutorial I have been working on to cover fencing with fence_xvm / fence_virtd. Perhaps it would help? https://alteeve.ca/w/Fencing_KVM_Virtual_Servers -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From linuxtovishesh at gmail.com Thu Jul 18 14:17:55 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Thu, 18 Jul 2013 19:47:55 +0530 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: <51E7F45B.7040602@alteeve.ca> References: <51E7F45B.7040602@alteeve.ca> Message-ID: Thanks for reply, I have to check value of /sys/class/net/virbr0/bridge/multicast_querier for centos6.4. Do this value only belong to bridged interface? Thanks On Thu, Jul 18, 2013 at 7:27 PM, Digimer wrote: > On 18/07/13 07:54, Vishesh kumar wrote: > >> Hi All, >> >> I am trying to implement fence_xvm using backend libvirt. Everything is >> setup fine and fence_virt.conf have following configuration >> ++++++++++++++++++++++++++++++**++++ >> >> backends { >> libvirt { >> uri = "qemu:///system"; >> } >> >> } >> >> listeners { >> multicast { >> port = "1229"; >> family = "ipv4"; >> address = "225.0.0.12"; >> key_file = "/etc/cluster/fence_xvm.key"; >> } >> >> } >> >> fence_virtd { >> module_path = "/usr/lib64/fence-virt"; >> backend = "libvirt"; >> listener = "multicast"; >> } >> ++++++++++++++++++++++++++++++**++++++++++++++++++++++++++++++**++++ >> >> But this does not work as daeom fence_virtd immediately after starting. >> I am unable to find any log as well. >> >> Changing backend to checkpoint resolve the issue of fence_virtd >> stoppage, but i have no idea to implement checkpoint backend. >> > > This is a not-quite-finished tutorial I have been working on to cover > fencing with fence_xvm / fence_virtd. Perhaps it would help? > > https://alteeve.ca/w/Fencing_**KVM_Virtual_Servers > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Thu Jul 18 15:21:10 2013 From: lists at alteeve.ca (Digimer) Date: Thu, 18 Jul 2013 11:21:10 -0400 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: References: <51E7F45B.7040602@alteeve.ca> Message-ID: <51E807E6.5020706@alteeve.ca> If your bridge is 'virbr0', then yes. If you use traditional bridging, probably not. Do you see the VMs from the how when you run 'fence_xvm -o list'? On 18/07/13 10:17, Vishesh kumar wrote: > Thanks for reply, > > I have to check value of /sys/class/net/virbr0/bridge/multicast_querier > for centos6.4. Do this value only belong to bridged interface? > > Thanks > > > On Thu, Jul 18, 2013 at 7:27 PM, Digimer > wrote: > > On 18/07/13 07:54, Vishesh kumar wrote: > > Hi All, > > I am trying to implement fence_xvm using backend libvirt. > Everything is > setup fine and fence_virt.conf have following configuration > ++++++++++++++++++++++++++++++__++++ > > backends { > libvirt { > uri = "qemu:///system"; > } > > } > > listeners { > multicast { > port = "1229"; > family = "ipv4"; > address = "225.0.0.12"; > key_file = "/etc/cluster/fence_xvm.key"; > } > > } > > fence_virtd { > module_path = "/usr/lib64/fence-virt"; > backend = "libvirt"; > listener = "multicast"; > } > ++++++++++++++++++++++++++++++__++++++++++++++++++++++++++++++__++++ > > But this does not work as daeom fence_virtd immediately after > starting. > I am unable to find any log as well. > > Changing backend to checkpoint resolve the issue of fence_virtd > stoppage, but i have no idea to implement checkpoint backend. > > > This is a not-quite-finished tutorial I have been working on to > cover fencing with fence_xvm / fence_virtd. Perhaps it would help? > > https://alteeve.ca/w/Fencing___KVM_Virtual_Servers > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person > without access to education? > > > > > -- > http://linuxmantra.com -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From linuxtovishesh at gmail.com Thu Jul 18 16:40:43 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Thu, 18 Jul 2013 22:10:43 +0530 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: <51E807E6.5020706@alteeve.ca> References: <51E7F45B.7040602@alteeve.ca> <51E807E6.5020706@alteeve.ca> Message-ID: On Thu, Jul 18, 2013 at 8:51 PM, Digimer wrote: > . Do you see the VMs from the how when you run 'fence_xvm -o list'? Thanks for reply. 'fence_xvm -o list' command resulting in timeout. Thanks -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Thu Jul 18 16:49:39 2013 From: lists at alteeve.ca (Digimer) Date: Thu, 18 Jul 2013 12:49:39 -0400 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: References: <51E7F45B.7040602@alteeve.ca> <51E807E6.5020706@alteeve.ca> Message-ID: <51E81CA3.6040806@alteeve.ca> On 18/07/13 12:40, Vishesh kumar wrote: > > On Thu, Jul 18, 2013 at 8:51 PM, Digimer > wrote: > > . Do you see the VMs from the how when you run 'fence_xvm -o list'? > > > Thanks for reply. > > 'fence_xvm -o list' command resulting in timeout. > > Thanks So the deamon is not running, it would seem. Try running 'fence_virtd -d99 -F' (show debug and do not fork into the background). -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From linuxtovishesh at gmail.com Fri Jul 19 13:50:31 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Fri, 19 Jul 2013 06:50:31 -0700 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: <51E81CA3.6040806@alteeve.ca> References: <51E7F45B.7040602@alteeve.ca> <51E807E6.5020706@alteeve.ca> <51E81CA3.6040806@alteeve.ca> Message-ID: Thanks. It worked now. I debugged by option -d99 -F and found issue with multicast. Thanks On Thu, Jul 18, 2013 at 9:49 AM, Digimer wrote: > On 18/07/13 12:40, Vishesh kumar wrote: > >> >> On Thu, Jul 18, 2013 at 8:51 PM, Digimer > > wrote: >> >> . Do you see the VMs from the how when you run 'fence_xvm -o list'? >> >> >> Thanks for reply. >> >> 'fence_xvm -o list' command resulting in timeout. >> >> Thanks >> > > So the deamon is not running, it would seem. Try running 'fence_virtd -d99 > -F' (show debug and do not fork into the background). > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Fri Jul 19 14:23:35 2013 From: emi2fast at gmail.com (emmanuel segura) Date: Fri, 19 Jul 2013 16:23:35 +0200 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: References: <51E7F45B.7040602@alteeve.ca> <51E807E6.5020706@alteeve.ca> <51E81CA3.6040806@alteeve.ca> Message-ID: Hello can you tell us how you resolved the problem, maybe it can be util for others people Thanks 2013/7/19 Vishesh kumar > Thanks. > > It worked now. I debugged by option -d99 -F and found issue with multicast. > > > Thanks > > On Thu, Jul 18, 2013 at 9:49 AM, Digimer wrote: > >> On 18/07/13 12:40, Vishesh kumar wrote: >> >>> >>> On Thu, Jul 18, 2013 at 8:51 PM, Digimer >> > wrote: >>> >>> . Do you see the VMs from the how when you run 'fence_xvm -o list'? >>> >>> >>> Thanks for reply. >>> >>> 'fence_xvm -o list' command resulting in timeout. >>> >>> Thanks >>> >> >> So the deamon is not running, it would seem. Try running 'fence_virtd >> -d99 -F' (show debug and do not fork into the background). >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> > > > > -- > http://linuxmantra.com > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- esta es mi vida e me la vivo hasta que dios quiera -------------- next part -------------- An HTML attachment was scrubbed... URL: From linuxtovishesh at gmail.com Fri Jul 19 14:58:31 2013 From: linuxtovishesh at gmail.com (Vishesh kumar) Date: Fri, 19 Jul 2013 07:58:31 -0700 Subject: [Linux-cluster] fence_xvm nopt working In-Reply-To: References: <51E7F45B.7040602@alteeve.ca> <51E807E6.5020706@alteeve.ca> <51E81CA3.6040806@alteeve.ca> Message-ID: Sure. I edited fence_virt.conf file and set interface=virbr0. Conf file that worked for me is as below backends { libvirt { uri = "qemu:///system"; } } listeners { multicast { interface=virbr0; port = "1229"; family = "ipv4"; address = "225.0.0.12"; key_file = "/etc/cluster/fence_xvm.key"; } } fence_virtd { module_path = "/usr/lib64/fence-virt"; backend = "libvirt"; listener = "multicast";} Thanks On Fri, Jul 19, 2013 at 7:23 AM, emmanuel segura wrote: > Hello > > can you tell us how you resolved the problem, maybe it can be util for > others people > > Thanks > > > 2013/7/19 Vishesh kumar > >> Thanks. >> >> It worked now. I debugged by option -d99 -F and found issue with >> multicast. >> >> >> Thanks >> >> On Thu, Jul 18, 2013 at 9:49 AM, Digimer wrote: >> >>> On 18/07/13 12:40, Vishesh kumar wrote: >>> >>>> >>>> On Thu, Jul 18, 2013 at 8:51 PM, Digimer >>> > wrote: >>>> >>>> . Do you see the VMs from the how when you run 'fence_xvm -o list'? >>>> >>>> >>>> Thanks for reply. >>>> >>>> 'fence_xvm -o list' command resulting in timeout. >>>> >>>> Thanks >>>> >>> >>> So the deamon is not running, it would seem. Try running 'fence_virtd >>> -d99 -F' (show debug and do not fork into the background). >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >>> >> >> >> >> -- >> http://linuxmantra.com >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > esta es mi vida e me la vivo hasta que dios quiera > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- http://linuxmantra.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From magawake at gmail.com Sun Jul 21 06:28:42 2013 From: magawake at gmail.com (Mag Gam) Date: Sun, 21 Jul 2013 02:28:42 -0400 Subject: [Linux-cluster] Mag Gam Message-ID: http://pureau.be/ixulknth/geg.lwald Mag Gam 7/21/2013 7:28:37 AM From anprice at redhat.com Tue Jul 23 16:47:29 2013 From: anprice at redhat.com (Andrew Price) Date: Tue, 23 Jul 2013 17:47:29 +0100 Subject: [Linux-cluster] gfs2-utils 3.1.6 Released Message-ID: <51EEB3A1.9060507@redhat.com> Hi, gfs2-utils 3.1.6 has been released. Notable changes include: - A large number of improvements and bug fixes in fsck.gfs2, bringing the ability to fix a wider range of issues. - mkfs.gfs2 now aligns resource groups to RAID stripes, automatically if it can, or by using new options (see the man page). It also now uses far fewer resources to create larger file systems. - There is a new test suite, which can be run with 'make check'. The suite is quite small at the moment but we will be adding more tests in due course. - gfs_controld has been retired, as it hasn't been required since Linux 3.3. - Documentation has been improved and a doc/README.contributing file has been added to aid anybody interested in contributing to gfs2-utils. See below for a full list of changes. The source tarball is available from: https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.6.tar.gz Please test, and do make sure to report bugs, whether they're crashers or typos. Please file them against the gfs2-utils component of Fedora (rawhide): https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide Regards, Andy Price Red Hat File Systems Changes since 3.1.5: Andrew Price (55): libgfs2: Fix build with bison 2.6 gfs2-utils: Update translation files mkfs.gfs2: Improve strings for translation gfs2-utils: Add the beginnings of a test suite gfs2-utils tests: Add a script to exercise the utils gfs2-utils: Rename lockcapture directory to scripts gfs2-utils: Add a doc on contributing mkfs.gfs2: Add translator doc comments tunegfs2: Update man page tunegfs2: i18n improvements mkfs.gfs2: i18n improvements gfs2-utils: Update translations and .gitignore libgfs2: Rework blk_alloc_i libgfs2: Make gfs2_rgrp_out accept char buffers mkfs.gfs2: Reduce memory usage gfs2-utils: Make the tool tests script more useful mkfs.gfs2: Separate user options from file system params libgfs2: Move lgfs2_field_print into gfs2l and make it static fsck.gfs2: Trivial typo fix gfs2-utils build: Enable silent rules by default libgfs2: Remove gfs2_next_rg_meta gfs2-utils: Build system fixes libgfs2: Don't release rgrp buffers which are still in use gfs2_edit: Fix divide by zero bug mkfs.gfs2: Add options for stripe size and width libgfs2: Remove 'writes' field from gfs2_sbd mkfs.gfs2: Link to libblkid mkfs.gfs2: Use libblkid for checking contents mkfs.gfs2: Add a struct to store device info libgfs2: Clarify gfs2_compute_bitstructs's parameters gfs2-utils build: Fix reporting lack of check gfs2l: Improve usage message and opt handling gfs2l: Enable setting the type of a block gfs2l: Add hash comments gfs2l: Add options to print block types and fields gfs2l: Read from stdin by default gfs2l: Improve grammar layout and path parsing gfs2-utils: Remove some unused build files gfs2-utils: Retire gfs_controld build: Put back AC_CONFIG_SRCDIR gfs2-utils: Fix some uninitialized variable warnings libgfs2: Remove dinode_alloc mkfs.gfs2: Set sunit and swidth from probed io limits mkfs.gfs2: Align resource groups to RAID stripes mkfs.gfs2: Create new resource groups on-demand mkfs.gfs2: Add align option and update docs mkfs.gfs2: Move the new rgrp creation code into libgfs2 gfs2-utils: Update translations init.d/gfs2: Work around nested mount points umount bug fsck.gfs2: Don't call gettext a second time in fsck_query() fsck.gfs2: Don't rely on cluster.conf when rebuilding sb gfs2-utils: Add some missing gettext calls gfs2-utils: Update translation template gfs2-utils: Update docs gfs2-utils: Update .gitignore and doc/Makefile.am Bob Peterson (66): gfs2_convert: mark rgrp bitmaps dirty when converting gfs2_convert: mark buffer dirty when switching dirs from meta to data gfs2_convert: remember number of blocks when converting quotas gfs2_convert: Use proper header size when reordering meta pointers gfs2_convert: calculate height 1 for small files that were once big gfs2_convert: clear out old di_mode before setting it gfs2_convert: mask out proper bits when identifying symlinks fsck.gfs2: Detect and fix mismatch in GFS1 formal inode number gfs2_grow: report bad return codes on error libgfs2: externalize dir_split_leaf libgfs2: allow dir_split_leaf to receive a leaf buffer libgfs2: let dir_split_leaf receive a "broken" lindex fsck.gfs2: Move function find_free_blk to util.c fsck.gfs2: Split out function to make sure lost+found exists fsck.gfs2: Check for formal inode mismatch when adding to lost+found fsck.gfs2: shorten some debug messages in lost+found fsck.gfs2: Move basic directory entry checks to separate function fsck.gfs2: Add formal inode check to basic dirent checks fsck.gfs2: Add new function to check dir hash tables fsck.gfs2: Special case '..' when processing bad formal inode number fsck.gfs2: Move function to read directory hash table to util.c fsck.gfs2: Misc cleanups fsck.gfs2: Verify dirent hash values correspond to proper leaf block fsck.gfs2: re-read hash table if directory height or depth changes fsck.gfs2: fix leaf blocks, don't try to patch the hash table fsck.gfs2: check leaf depth when validating leaf blocks fsck.gfs2: small cleanups fsck.gfs2: reprocess inodes when blocks are added fsck.gfs2: Remove redundant leaf depth check fsck.gfs2: link dinodes that only have extended attribute problems fsck.gfs2: Add clarifying message to duplicate processing fsck.gfs2: separate function to calculate metadata block header size fsck.gfs2: Rework the "undo" functions fsck.gfs2: Check for interrupt when resolving duplicates fsck.gfs2: Consistent naming of struct duptree variables fsck.gfs2: Keep proper counts when duplicates are found fsck.gfs2: print metadata block reference on data errors fsck.gfs2: print block count values when fixing them fsck.gfs2: Do not invalidate metablocks of dinodes with invalid mode fsck.gfs2: Log when unrecoverable data block errors are encountered fsck.gfs2: don't remove buffers from the list when errors are found fsck.gfs2: Don't flag GFS1 non-dinode blocks as duplicates fsck.gfs2: externalize check_leaf fsck.gfs2: pass2: check leaf blocks when fixing hash table fsck.gfs2: standardize check_metatree return codes fsck.gfs2: don't invalidate files with duplicate data block refs fsck.gfs2: check for duplicate first references fsck.gfs2: When flagging a duplicate reference, show valid or invalid fsck.gfs2: major duplicate reference reform fsck.gfs2: Remove all bad eattr blocks fsck.gfs2: Remove unused variable fsck.gfs2: double-check transitions from dinode to data fsck.gfs2: Stop "undo" process when error data block is reached fsck.gfs2: Don't allocate leaf blocks in pass1 fsck.gfs2: take hash table start boundaries into account fsck.gfs2: delete all duplicates from unrecoverable damaged dinodes gfs2_edit: print formal inode numbers and hash value on dir display fsck.gfs2: fix some log messages fsck.gfs2: Fix directory link on relocated directory dirents fsck.gfs2: Fix infinite loop in pass1b caused by duplicates in hash table fsck.gfs2: don't check newly created lost+found in pass2 fsck.gfs2: avoid negative number in leaf depth fsck.gfs2: Detect and fix duplicate references in hash tables gfs2_edit: Add new option to print all bitmaps for an rgrp gfs2_edit: display pointer offsets for directory dinodes gfs2_edit: fix a segfault with file names > 255 bytes Callum Massey (1): gfs2-utils: Fix build warnings in Fedora 18 David Teigland (1): gfs2: add native setup to man page Paul Evans (1): libgfs2: Fix resource leak, variable "result" going out of scope Shane Bradley (5): gfs2-lockcapture: Modified some of the data gathered gfs2_trace: Added a script called gfs2_trace for kernel tracing debugging. gfs2_lockcapture: The script now returns a correct exit code when the script exits. gfs2_lockcapture: Capture the status of the cluster nodes and find the clusternode name and id. gfs2_lockcapture: Various script and man page updates Sitsofe Wheeler (1): Fix clang --analyze warning. Steven Whitehouse (3): libgfs2: Add readahead for rgrp headers fsck: Speed up reading of dir leaf blocks fsck: Clean up pass1 inode iteration code From dc12078 at gmail.com Wed Jul 24 13:51:13 2013 From: dc12078 at gmail.com (D C) Date: Wed, 24 Jul 2013 09:51:13 -0400 Subject: [Linux-cluster] rgmanager hangs when shutting down service. Message-ID: I setup a basic cluster for testing, with a virtual ip (on a bonded interface), and apache. I've verified that services work on both nodes, but I have an issue one of them during shutdown. CentOS 6.3 rpm -q rgmanager ricci modcluster resource-agents rgmanager-3.0.12.1-12.el6.x86_64 ricci-0.16.2-55.el6.x86_64 modcluster-0.16.2-18.el6.x86_64 resource-agents-3.9.2-12.el6_3.2.x86_64 [root at lust-02 cluster]# clusvcadm -d apache-service Local machine disabling service:apache-service... Nothing shows up in the logs, and I was able to verify that apache is still running, and the ip address is still active. I ran the command again with strace, but it seems to also just hang. Below is the entire output of the strace. [root at clust-02 cluster]# strace clusvcadm -d apache-service execve("/usr/sbin/clusvcadm", ["clusvcadm", "-d", "apache-service"], [/* 22 vars */]) = 0 brk(0) = 0x1f12000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce8c000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=32069, ...}) = 0 mmap(NULL, 32069, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa81ce84000 close(3) = 0 open("/usr/lib64/libcman.so.3", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\23`N4\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=21272, ...}) = 0 mmap(0x344e600000, 2114200, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x344e600000 mprotect(0x344e604000, 2097152, PROT_NONE) = 0 mmap(0x344e804000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x344e804000 close(3) = 0 open("/lib64/libpthread.so.0", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\\\240\3668\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=145720, ...}) = 0 mmap(0x38f6a00000, 2212768, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x38f6a00000 mprotect(0x38f6a17000, 2097152, PROT_NONE) = 0 mmap(0x38f6c17000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x38f6c17000 mmap(0x38f6c19000, 13216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x38f6c19000 close(3) = 0 open("/usr/lib64/liblogthread.so.3", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\16\340N4\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=11592, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce83000 mmap(0x344ee00000, 2112968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x344ee00000 mprotect(0x344ee02000, 2093056, PROT_NONE) = 0 mmap(0x344f001000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x344f001000 mmap(0x344f002000, 7624, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x344f002000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\355a\3668\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1918016, ...}) = 0 mmap(0x38f6600000, 3741864, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x38f6600000 mprotect(0x38f6789000, 2093056, PROT_NONE) = 0 mmap(0x38f6988000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x188000) = 0x38f6988000 mmap(0x38f698d000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x38f698d000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce82000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce81000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce80000 arch_prctl(ARCH_SET_FS, 0x7fa81ce81700) = 0 mprotect(0x38f6c17000, 4096, PROT_READ) = 0 mprotect(0x38f6988000, 16384, PROT_READ) = 0 mprotect(0x38f601f000, 4096, PROT_READ) = 0 munmap(0x7fa81ce84000, 32069) = 0 set_tid_address(0x7fa81ce819d0) = 1095 set_robust_list(0x7fa81ce819e0, 0x18) = 0 futex(0x7fff3157e0ac, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7fff3157e0ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 7fa81ce81700) = -1 EAGAIN (Resource temporarily unavailable) rt_sigaction(SIGRTMIN, {0x38f6a05ae0, [], SA_RESTORER|SA_SIGINFO, 0x38f6a0f500}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x38f6a05b70, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x38f6a0f500}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=10240*1024, rlim_max=RLIM_INFINITY}) = 0 rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x38f6632920}, {SIG_DFL, [], 0}, 8) = 0 brk(0) = 0x1f12000 brk(0x1f33000) = 0x1f33000 socket(PF_FILE, SOCK_STREAM, 0) = 3 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 connect(3, {sa_family=AF_FILE, path="/var/run/cman_client"}, 110) = 0 open("/dev/zero", O_RDONLY) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 writev(3, [{"NAMC\3\0\0\20\24\0\0\0\7\0\0\0\0\0\0\0", 20}], 1) = 20 recvfrom(3, "NAMCk&\233?\210\3\0\0\7\0\0@\0\0\0\0", 20, 0, NULL, NULL) = 20 read(3, "\2\0\0\0\270\1\0\0\1\0\0\0\0\0\0\0\0\0\0\0\234\0\0\0\2\0\0\0e-cl"..., 884) = 884 writev(3, [{"NAMC\3\0\0\20\24\0\0\0\7\0\0\0\0\0\0\0", 20}], 1) = 20 recvfrom(3, "NAMCk&\233?\210\3\0\0\7\0\0@\0\0\0\0", 20, 0, NULL, NULL) = 20 read(3, "\2\0\0\0\270\1\0\0\1\0\0\0\0\0\0\0\0\0\0\0\234\0\0\0\2\0\0\0e-cl"..., 884) = 884 writev(3, [{"NAMC\3\0\0\20\314\1\0\0\220\0\0\0\0\0\0\0", 20}, {" \313\350\34\0\0\0\0\0\0\340\263\257b\376\377\0\0\366\302\301\353q\0\0\0\0\0\0\0\0\0"..., 440}], 2) = 460 recvfrom(3, "NAMCk&\233?\320\1\0\0\220\0\0@\0\0\0\0", 20, 0, NULL, NULL) = 20 read(3, "\0\0\0\0\270\1\0\0\2\0\0\0\1\0\0\0\0\0\0\0\4\0\0\0\2\0\0\0e-cl"..., 444) = 444 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa81ce8b000 write(1, "Local machine disabling service:"..., 49Local machine disabling service:apache-service...) = 49 socket(PF_FILE, SOCK_STREAM, 0) = 5 connect(5, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"}, 110) = 0 select(6, NULL, [5], [5], NULL) = 1 (out [5]) write(5, "h\0\0\0\4\261\227\36\22:\274\0\0\0\0h\0\23\205\202\0\0\0\0\0\0\0\0\0\0\0\0"..., 112) = 112 select(6, [5], NULL, [5], NULL Thanks, Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: From churnd at gmail.com Fri Jul 26 02:29:58 2013 From: churnd at gmail.com (ch urnd) Date: Thu, 25 Jul 2013 22:29:58 -0400 Subject: [Linux-cluster] fence_drac5 timeouts Message-ID: I'm trying to get fence_drac5 working on a cluster I'm setting up of two Dell R410's. The primary issue I'm seeing are timeouts. The fence does seem to work as the other node will get shut down, but the script always exits 1. Here's the output: # fence_drac5 -a 192.168.1.100 --power-timeout 30 -x -l root -p calvin -c 'admin1->' -o reboot Connection timed out # fence_drac5 -a 192.168.1.100 --power-timeout 30 -v -x -l root -p calvin -c 'admin1->' -o reboot root at 192.168.1.100's password: /admin1-> racadm serveraction powerstatus Server power status: ON /admin1-> /admin1-> racadm serveraction powerdown Server power operation successful /admin1->Traceback (most recent call last): File "/usr/sbin/fence_drac5", line 154, in main() File "/usr/sbin/fence_drac5", line 137, in main result = fence_action(conn, options, set_power_status, get_power_status, get_list_devices) File "/usr/share/fence/fencing.py", line 838, in fence_action if wait_power_status(tn, options, get_power_fn) == 0: File "/usr/share/fence/fencing.py", line 744, in wait_power_status if get_power_fn(tn, options) != options["-o"]: File "/usr/sbin/fence_drac5", line 38, in get_power_status status = re.compile("(^|: )(ON|OFF|Powering ON|Powering OFF)\s*$", re.IGNORECASE | re.MULTILINE).search(conn.before).group(2) AttributeError: 'NoneType' object has no attribute 'group' Even though I pass "-o reboot", it still powers off. It does the same even if I don't pass that option. I added --power-timeout 30 in the latest test to see if that'd help but no dice. Doesn't work without it either. I have tried fence_ipmilan & it works great, but the iDRAC interfaces are somewhat exposed & need to use SSH for security reasons, which limits me to fence_drac5. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Sat Jul 27 00:40:26 2013 From: lists at alteeve.ca (Digimer) Date: Fri, 26 Jul 2013 20:40:26 -0400 Subject: [Linux-cluster] Problem deleting running VM from rgmanager Message-ID: <51F316FA.8010108@alteeve.ca> Hi all, I've got a problem where I deleted a running VM from the cluster using; ccs -h localhost --activate --sync --password "secret" --rmvm vm01-win7 This kind of worked, in that the VM was removed from cluster.conf, but 'clustat' still shows it. The logs from the call are: ===== Jul 26 20:19:01 an-c05n01 ricci[18020]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18066]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18069]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18071]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1428781577' Jul 26 20:19:01 an-c05n01 ricci[18075]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18077]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18080]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:01 an-c05n01 ricci[18082]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/881799278' Jul 26 20:19:03 an-c05n01 ricci[18088]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 ricci[18090]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 ricci[18093]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 ricci[18095]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/439919971' Jul 26 20:19:03 an-c05n01 modcluster: Updating cluster.conf Jul 26 20:19:03 an-c05n01 corosync[3479]: [QUORUM] Members[2]: 1 2 Jul 26 20:19:03 an-c05n01 ricci[18140]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 ricci[18170]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 rgmanager[3710]: Reconfiguring Jul 26 20:19:03 an-c05n01 ricci[18194]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:03 an-c05n01 ricci[18234]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/446527166' Jul 26 20:19:04 an-c05n01 ricci[18457]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:04 an-c05n01 ricci[18496]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:04 an-c05n01 ricci[18528]: Executing '/usr/bin/virsh nodeinfo' Jul 26 20:19:04 an-c05n01 ricci[18560]: Executing '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1207456461' Jul 26 20:19:04 an-c05n01 modcluster: Updating cluster.conf Jul 26 20:19:04 an-c05n01 corosync[3479]: [QUORUM] Members[2]: 1 2 Jul 26 20:19:05 an-c05n01 kernel: vbr2: port 4(vnet2) entering disabled state Jul 26 20:19:05 an-c05n01 kernel: device vnet2 left promiscuous mode Jul 26 20:19:05 an-c05n01 kernel: vbr2: port 4(vnet2) entering disabled state Jul 26 20:19:06 an-c05n01 rgmanager[3710]: vm:vm01-win7 removed from the config, but I am not stopping it. Jul 26 20:19:06 an-c05n01 rgmanager[3710]: Reconfiguring Jul 26 20:19:07 an-c05n01 ntpd[2794]: Deleting interface #16 vnet2, fe80::fc54:ff:fea5:37ea#123, interface stats: received=0, sent=0, dropped=0, active_time=135 secs ===== However, clustat still shows; ===== Cluster Status for an-cluster-05 @ Fri Jul 26 20:37:17 2013 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ an-c05n01.alteeve.ca 1 Online, rgmanager an-c05n02.alteeve.ca 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:storage_n01 an-c05n01.alteeve.ca started service:storage_n02 an-c05n02.alteeve.ca started vm:vm01-win7 an-c05n02.alteeve.ca started vm:vm02-rhel6 an-c05n02.alteeve.ca started vm:vm03-debian7 an-c05n01.alteeve.ca started vm:vm04-solaris11 an-c05n02.alteeve.ca started vm:vm05-win2008r2 an-c05n02.alteeve.ca started vm:vm06-win8 an-c05n01.alteeve.ca started vm:vm07-win2012 an-c05n02.alteeve.ca started vm:vm08-freebsd9 an-c05n01.alteeve.ca started vm:vm09-suse11 an-c05n01.alteeve.ca started ===== Trying to stop it produces; ===== an-c05n02:~# clusvcadm -d vm:vm01-win7 Local machine disabling vm:vm01-win7...Failure ===== CentOS 6.4, fully up to date; rgmanager-3.0.12.1-17.el6.x86_64 -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From lists at alteeve.ca Sat Jul 27 00:58:52 2013 From: lists at alteeve.ca (Digimer) Date: Fri, 26 Jul 2013 20:58:52 -0400 Subject: [Linux-cluster] Problem deleting running VM from rgmanager In-Reply-To: <51F316FA.8010108@alteeve.ca> References: <51F316FA.8010108@alteeve.ca> Message-ID: <51F31B4C.8030903@alteeve.ca> I rebuilt the VM and deleted it a second time and it worked properly... I hate bugs like that. digimer On 26/07/13 20:40, Digimer wrote: > Hi all, > > I've got a problem where I deleted a running VM from the cluster using; > > ccs -h localhost --activate --sync --password "secret" --rmvm vm01-win7 > > This kind of worked, in that the VM was removed from cluster.conf, > but 'clustat' still shows it. The logs from the call are: > > ===== > Jul 26 20:19:01 an-c05n01 ricci[18020]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18066]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18069]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18071]: Executing > '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1428781577' > Jul 26 20:19:01 an-c05n01 ricci[18075]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18077]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18080]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:01 an-c05n01 ricci[18082]: Executing > '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/881799278' > Jul 26 20:19:03 an-c05n01 ricci[18088]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 ricci[18090]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 ricci[18093]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 ricci[18095]: Executing > '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/439919971' > Jul 26 20:19:03 an-c05n01 modcluster: Updating cluster.conf > Jul 26 20:19:03 an-c05n01 corosync[3479]: [QUORUM] Members[2]: 1 2 > Jul 26 20:19:03 an-c05n01 ricci[18140]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 ricci[18170]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 rgmanager[3710]: Reconfiguring > Jul 26 20:19:03 an-c05n01 ricci[18194]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:03 an-c05n01 ricci[18234]: Executing > '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/446527166' > Jul 26 20:19:04 an-c05n01 ricci[18457]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:04 an-c05n01 ricci[18496]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:04 an-c05n01 ricci[18528]: Executing '/usr/bin/virsh nodeinfo' > Jul 26 20:19:04 an-c05n01 ricci[18560]: Executing > '/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1207456461' > Jul 26 20:19:04 an-c05n01 modcluster: Updating cluster.conf > Jul 26 20:19:04 an-c05n01 corosync[3479]: [QUORUM] Members[2]: 1 2 > Jul 26 20:19:05 an-c05n01 kernel: vbr2: port 4(vnet2) entering disabled > state > Jul 26 20:19:05 an-c05n01 kernel: device vnet2 left promiscuous mode > Jul 26 20:19:05 an-c05n01 kernel: vbr2: port 4(vnet2) entering disabled > state > Jul 26 20:19:06 an-c05n01 rgmanager[3710]: vm:vm01-win7 removed from the > config, but I am not stopping it. > Jul 26 20:19:06 an-c05n01 rgmanager[3710]: Reconfiguring > Jul 26 20:19:07 an-c05n01 ntpd[2794]: Deleting interface #16 vnet2, > fe80::fc54:ff:fea5:37ea#123, interface stats: received=0, sent=0, > dropped=0, active_time=135 secs > ===== > > However, clustat still shows; > > ===== > Cluster Status for an-cluster-05 @ Fri Jul 26 20:37:17 2013 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > an-c05n01.alteeve.ca 1 Online, rgmanager > an-c05n02.alteeve.ca 2 Online, Local, rgmanager > > Service Name Owner (Last) State > ------- ---- ----- ------ ----- > service:storage_n01 an-c05n01.alteeve.ca started > service:storage_n02 an-c05n02.alteeve.ca started > vm:vm01-win7 an-c05n02.alteeve.ca started > vm:vm02-rhel6 an-c05n02.alteeve.ca started > vm:vm03-debian7 an-c05n01.alteeve.ca started > vm:vm04-solaris11 an-c05n02.alteeve.ca started > vm:vm05-win2008r2 an-c05n02.alteeve.ca started > vm:vm06-win8 an-c05n01.alteeve.ca started > vm:vm07-win2012 an-c05n02.alteeve.ca started > vm:vm08-freebsd9 an-c05n01.alteeve.ca started > vm:vm09-suse11 an-c05n01.alteeve.ca started > ===== > > Trying to stop it produces; > > ===== > an-c05n02:~# clusvcadm -d vm:vm01-win7 > Local machine disabling vm:vm01-win7...Failure > ===== > > CentOS 6.4, fully up to date; rgmanager-3.0.12.1-17.el6.x86_64 > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From mgrac at redhat.com Tue Jul 30 11:00:49 2013 From: mgrac at redhat.com (Marek Grac) Date: Tue, 30 Jul 2013 13:00:49 +0200 Subject: [Linux-cluster] fence-agents-4.0.2 stable release Message-ID: <51F79CE1.3040707@redhat.com> Welcome to the fence-agents 4.0.2 release. This release includes a minor bug fix, invalid names in fence_eps, fence_rhevm and fence_xenapi. In this release you can also find a new fence agent for OVH (http://www.ovh.com) and symbolic link fence_ilo4 which runs fence_ipmilan with required arguments. For the 4.0.x series, I plan to release a new version on at least monthly basis. The new source tarball can be downloaded here: https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.1.tar.xz To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. m, From lists at alteeve.ca Tue Jul 30 13:55:42 2013 From: lists at alteeve.ca (Digimer) Date: Tue, 30 Jul 2013 09:55:42 -0400 Subject: [Linux-cluster] [Cluster-devel] fence-agents-4.0.2 stable release In-Reply-To: <51F79CE1.3040707@redhat.com> References: <51F79CE1.3040707@redhat.com> Message-ID: <51F7C5DE.7040509@alteeve.ca> On 30/07/13 07:00, Marek Grac wrote: > Welcome to the fence-agents 4.0.2 release. > > This release includes a minor bug fix, invalid names in fence_eps, > fence_rhevm and fence_xenapi. > > In this release you can also find a new fence agent for OVH > (http://www.ovh.com) and symbolic link fence_ilo4 which runs > fence_ipmilan with required arguments. > > For the 4.0.x series, I plan to release a new version on at least > monthly basis. > > The new source tarball can be downloaded here: > > https://fedorahosted.org/releases/f/e/fence-agents/fence-agents-4.0.1.tar.xz > > > To report bugs or issues: > > https://bugzilla.redhat.com/ > > Would you like to meet the cluster team or members of its community? > > Join us on IRC (irc.freenode.net #linux-cluster) and share your > experience with other sysadministrators or power users. > > Thanks/congratulations to all people that contributed to achieve this > great milestone. > > m, Yet another release goes by and I didn't get around to asking for a new agent to be added. :) I've got a fence agent for TrippLite switched PDUs; https://github.com/digimer/fence_tripplite_snmp They're hardly ideal as fence devices because they are slow, but they do work reliably and their cost makes them very common in DCs. Also, \o/ new release! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From russell at jonesmail.me Wed Jul 31 02:14:54 2013 From: russell at jonesmail.me (Russell Jones) Date: Tue, 30 Jul 2013 21:14:54 -0500 Subject: [Linux-cluster] corosync and token, token_retransmit, token_retransmit_before_loss_const confusion Message-ID: <51F8731E.10008@jonesmail.me> Hi all, I am trying to understand how the corosync token, token_retansmit, and token_retransmit_before_loss_const variables all tie in together. I have a standard RHCS v3 cluster set up and running. The token timeout is set to 10000. When testing it seems to detect failed members pretty consistently within 10 seconds. What I am not understanding is *when* a node is declared dead, and a fence call is actually made. The man pages show that the cluster is reconfigured when the "token" time is reached, and also when token_retransmits_before_loss_const is reached. This is confusing :-) Which one is it that will reform the cluster? Both? When does one taken precedence over the other? Thanks! From Maeulen at awp-shop.de Wed Jul 31 13:57:33 2013 From: Maeulen at awp-shop.de (=?iso-8859-1?Q?Johannes_M=E4ulen?=) Date: Wed, 31 Jul 2013 13:57:33 +0000 Subject: [Linux-cluster] fence_ipmilan Message-ID: <9A757AF2CA7F204A8F2444FFC5C27C30485F536C@Exchange2010.Skynet.local> Hi there, I?m trying to setup a cluster and had issues with ?fence_ipmilan? from the package fence-agents. I?m running debian 7.1 with a 3.2.0-4-amd64 kernel. ?fence_ipmilan ?V? gives ?fence_ipmilan 3.1.5? My Cluster nodes are running on Supermicro Motherboards with IPMI on-board. (To be exact: http://www.supermicro.nl/products/motherboard/Xeon/C202_C204/X9SCA-F.cfm ) I?ve experienced the following behavior: fence_ipmilan -a xxx.xxx.xxx.xxx -l USER-p PASS -v -o off; echo $? Powering off machine @ IPMI:xxx.xxx.xxx.xxx...Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... ipmilan: Power still on Failed 1 More or less in the same moment when I got this message the machine went down. So all the commands were working, but not in the expected time. ( Using Supermicro Mainboards with IPMI onboard, http://www.supermicro.nl/products/motherboard/Xeon/C202_C204/X9SCA-F.cfm ) I?ve played around with available parameters and wasn?t able to fix this behavior. So I went into the source code(fence-agents-3.1.5/fence/agents/ipmilan/ipmilan.c) and had a look at the ipmi_off function. There was a fixed value of 2 seconds to sleep. I modified this to use the same parameter like ipmi_on: ipmi-> i_power_wait instead of 2, so that I can modify this value and test if it has effect on my problem. Now when I use the modified version of fence_ipmilan the output looks like: ./fence_ipmilan -a xxx.xxx.xxx.xxx -l USER -p PASS -T 10 -v -o off ; echo $? Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power off'... Spawning: '/usr/bin/ipmitool -I lan -H 'xxx.xxx.xxx.xxx' -U 'USER' -P 'PASS' -v chassis power status'... Done 0 So I think this fixed my problem, and I think it might help other users experiencing the same issues. Kind regards Johannes -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 6310 bytes Desc: not available URL: