From icyrace at gmail.com Thu Apr 1 04:22:14 2010 From: icyrace at gmail.com (=?GB2312?B?wO7OxOXQKEdhdmluIExlZSk=?=) Date: Thu, 1 Apr 2010 12:22:14 +0800 Subject: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster Message-ID: Hi, all I met the problem when using the cluster but need to restart one of the resources manually. The scenario likes this: For example, one resource httpd is managed by Red Hat Cluster Suite. Sometimes I need to restart it manually by : service httpd restart The script is located in /etc/init.d/httpd When I restart it, the cluster sometimes detected the httpd is stopped, and reported a error. How can I restart only one of the resources and do not let the cluster detect the failure? We need to restart only one of the resources but not restart all the resources. So restart the who resource group doesn?t work for us. Does anyone has similar experience? Thanks & Best Regards!! --------------------------- Gavin Lee -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcasale at activenetwerx.com Thu Apr 1 05:00:22 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Thu, 1 Apr 2010 05:00:22 +0000 Subject: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster In-Reply-To: References: Message-ID: >How can I restart only one of the resources and do not let the cluster detect the failure? We need to restart only one of the resources but not restart all the resources. >So restart the who resource group doesn't work for us. $ info clusvcadm Look for the -Z option, it'll freeze it on the member and prevent status checks. Don't forget to unfreeze:) From markus.wolfgart at dlr.de Fri Apr 2 08:00:22 2010 From: markus.wolfgart at dlr.de (Markus Wolfgart) Date: Fri, 02 Apr 2010 10:00:22 +0200 Subject: [Linux-cluster] gfs2-utils source for recovery purpose of a corrupt gfs2 partition Message-ID: <4BB5A416.8020604@dlr.de> Hi Cluster/GFS Experts, Hi Bob, as I get no response concerning my recovery issue, I would like to summarize my activities, which could help someone else running in such a problem with gfs2. As the corrupted gfs2 (12TB b4 grow 25TB after) was hosted on a SE6540 disk array and the master is a Sun X4150 4GB machine with a CentOS 5.3 (i686/PAE), I run in the out of memory problem during the run of fsck.gfs2. No matter what i have done, I was not able even use the temporary swap file as found in some postings suggested. As the os installation was done by other guys and they insist on this configuration, I boot a rescue x86_64 dvd in order to overcame the memory restriction. In addition to this I was lucky to have some spare memory to increase the ram to 16GB. As I don't like to run the lvm/cman software as well as honestly speeking not having much experience on this, I create and mount a large xfs partition an the disk array to create a temporary swap file and to store the files I hope to recover from the corrupted gfs2 partition. An investigation via dd | od -c on the first mb of the gfs2 partition reveal that after the lvm2 block of a size of 192k the sb (super block) of the gfs2 starts. creating an loopback device with an offset of 196608 bytes let my access the file system via fsck without dlm/clvm etc. losetup /dev/loop4 /dev/sdb -o 196608 /sbin/fsck.gfs2 -f -p -y -v /dev/loop4 The index of the loop device depends on the usage of the rescue system. Check it with losetup -a and take a number which is not currently used. After some attempts on checking the gfs2 running again in the oom my temp swap space is now about 0.7TG (no joke). I start with 20GB of swap space and double the size every oom abort of fsck. Now I was lucky to pass the first and run into the second check Initializing fsck Initializing lists... jid=0: Looking at journal... jid=0: Journal is clean. jid=1: Looking at journal... jid=1: Journal is clean. jid=2: Looking at journal... jid=2: Journal is clean. jid=3: Looking at journal... jid=3: Journal is clean. jid=4: Looking at journal... jid=4: Journal is clean. jid=5: Looking at journal... jid=5: Journal is clean. jid=6: Looking at journal... jid=6: Journal is clean. jid=7: Looking at journal... jid=7: Journal is clean. Initializing special inodes... Validating Resource Group index. Level 1 RG check. Level 2 RG check. Existing resource groups: 1: start: 17 (0x11), length = 529563 (0x8149b) 2: start: 529580 (0x814ac), length = 524241 (0x7ffd1) 3: start: 1053821 (0x10147d), length = 524241 (0x7ffd1) 4: start: 1578062 (0x18144e), length = 524241 (0x7ffd1) ... 9083643: start: 4762017571061 (0x454be5da0f5), length = 524241 (0x7ffd1) 9083644: start: 4762018095302 (0x454be65a0c6), length = 524241 (0x7ffd1) 9083645: start: 4762018619543 (0x454be6da097), length = 524241 (0x7ffd1) 9083646: start: 4762019143784 (0x454be75a068), length = 524241 (0x7ffd1) ... In addition to this I start to explore the code of gfs2-utils (folder libgfs2 and folder fsck) and was able to list the super block infos. As mentioned im my previous posting I was able to list all my file names of interest located in a 7TB big image created from the dd output. all files I'm looking for found in the directory structure (about 16 tousend) could be seen by a simple od -s (string mode) or by the xxd command. xxd -a -u -c 64 -s 671088640 dev_oa_vg_storage1_oa_lv_storage1.bin | less The first snippet of code I'm used to play around looks like listed below and is just plain a cut and paste of the utils code: The code just show some information of the super block. #include #include #include #include #include #include #include #include #include #define _(String) gettext(String) #include "gfs2structure.h" int main(int argc, char *argv[]) { int fd; char *device, *field; unsigned char buf[GFS2_BASIC_BLOCK]; unsigned char input[256]; unsigned char output[256]; struct gfs2_sb sb; struct gfs2_buffer_head dummy_bh; struct gfs2_dirent dirent,*dentp;; //struct gfs2_inum sbmd; //struct gfs2_inum sbrd; dummy_bh.b_data = (char *)buf; //memset(&dirent, 0, sizeof(struct gfs2_dirent)); device = argv[1]; fd = open(device, O_RDONLY); if (fd < 0) die("can't open %s: %s\n", device, strerror(errno)); if (lseek(fd, GFS2_SB_ADDR * GFS2_BASIC_BLOCK, SEEK_SET) != GFS2_SB_ADDR * GFS2_BASIC_BLOCK) { fprintf(stderr, _("bad seek: %s from %s:%d: superblock\n"), strerror(errno), __FUNCTION__, __LINE__); exit(-1); } if (read(fd, buf, GFS2_BASIC_BLOCK) != GFS2_BASIC_BLOCK) { fprintf(stderr, _("bad read: %s from %s:%d: superblock\n"), strerror(errno), __FUNCTION__, __LINE__); exit(-1); } gfs2_sb_in(&sb, &dummy_bh); if (sb.sb_header.mh_magic != GFS2_MAGIC || sb.sb_header.mh_type != GFS2_METATYPE_SB) die( _("there isn't a GFS2 filesystem on %s\n"), device); printf( _("current lock protocol name = \"%s\"\n"),sb.sb_lockproto); printf( _("current lock table name = \"%s\"\n"),sb.sb_locktable); printf( _("current ondisk format = %u\n"),sb.sb_fs_format); printf( _("current multihost format = %u\n"),sb.sb_multihost_format); //printf( _("current uuid = %s\n"), str_uuid(sb.sb_uuid)); printf( _("current block size = %u\n"), sb.sb_bsize); printf( _("current block size shift = %u\n"), sb.sb_bsize_shift); printf( _("masterdir-addr = %u\n"), sb.sb_master_dir.no_addr); printf( _("masterdir-fino = %u\n"), sb.sb_master_dir.no_formal_ino); printf( _("rootdir-fino = %u\n"), sb.sb_root_dir.no_addr); printf( _("rootdir-fino = %u\n"), sb.sb_root_dir.no_formal_ino); printf( _("dummy_bh.sdp = %p\n"), dummy_bh.sdp); printf( _("sdp->blks_alloced = %u\n"), dummy_bh.sdp->blks_alloced); printf( _("sdp->blks_total = %u\n"), dummy_bh.sdp->blks_total); printf( _("sdp->device_name = %s\n"), dummy_bh.sdp->device_name); //gfs2_dirent_in(&dirent, (char *)dentp); //gfs2_dirent_print(&dirent, output); //gfs2_dinode_print(struct gfs2_dinode *di); close(fd); } I will keep you all informed on the progress of this story. My next step will be - depending on the further progress of the fsck - (if it fails or not) to overwrite the "lock_" and/or "fsck_" flags in the image and to mount the gfs2 image to see what happens. Meanwhile during the run of fsck which could take a while (used swap space now is more the 510GB) as I was told, I hope someone could show me how to run through the inodes using libgfs2 to collect data from them or to point me to the right direction. Many Thanks in Advance and a nice Easter weekend. Bye Markus ******************************************************* Markus Wolfgart DLR Oberpfaffenhofen German Remote Sensing Data Center . . . e-mail: markus.wolfgart at dlr.de ********************************************************** Hi Bob, thanks for prompt reply! the fs originally was 12.4TB (6TB used) big. After a resize attempt to 25TB by gfs2_grow (very very old version gfs2-utils 1.62) The fs was expand and the first impression looks good as df reported the size of 25TB. But looking from the second node to the fs (two nod system) ls -r and ls -R throws IO errors and gfs2 mount get frozen (reboot of machine was performed). As no shrinking of gfs2 was possible to rollback, the additional physical volume was removed from the logical volume (lvresize to org. size & pvremove). This hard cut of the gsf2 unfenced partition should be hopefully repaired by the fsck.gfs2 (newest version), this was my thought. Even if this will not be the case, I could not run the fsck.gfs2 due to a "of memory in compute_rgrp_layout" message. see strace output: write(1, "9098813: start: 4769970307031 (0"..., 739098813: start: 4769970307031 (0x4569862bfd7), length = 524241 (0x7ffd1) ) = 73 write(1, "9098814: start: 4769970831272 (0"..., 739098814: start: 4769970831272 (0x456986abfa8), length = 524241 (0x7ffd1) ) = 73 write(1, "9098815: start: 4769971355513 (0"..., 739098815: start: 4769971355513 (0x4569872bf79), length = 524241 (0x7ffd1) ) = 73 write(1, "9098816: start: 4769971879754 (0"..., 739098816: start: 4769971879754 (0x456987abf4a), length = 524241 (0x7ffd1) ) = 73 write(1, "9098817: start: 4769972403995 (0"..., 739098817: start: 4769972403995 (0x4569882bf1b), length = 524241 (0x7ffd1) ) = 73 brk(0xb7dea000) = 0xb7dc9000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) write(2, "Out of memory in compute_rgrp_la"..., 37Out of memory in compute_rgrp_layout ) = 37 exit_group(-1) = ? As I had already increased my swapspace swapon -s Filename Type Size Used Priority /dev/sda3 partition 8385920 0 -3 /var/swapfile.bin file 33554424 144 1 and run again the same situation as before I decide to start to extract the lost files by a c prog. Now I have create a big Image (7TB) on a xfs partition and would like to recover my files of interest by a program using libgfs2 or part of the source from gfs2-utils, as mentioned in my previous posting. As I see nearly all of the files located in the dir structure and get the position in the image by a simple string command, I hope to extract them in a simpler way. The RG size was set to the Max value of 2GB end each file I'm looking for is about 250BM big. The amount of files to be recovered is more then 16k. Every file have a header with his file name ant the total size, so it should be easy to check if the recovery of it is successful. So thats my theory, but this could be a easter vacation project without the right knowledge of gfs2. As I'm lucky to have the gfs2-utils source I hope it could be done. But if there is a simpler way to do a recovery by the installed gfs2 progs like gfs2_edit or gfs2_tool or other tools it would be nice if someone could show my the proper way. Many Thanks in advance Markus -- ******************************************************* Markus Wolfgart DLR Oberpfaffenhofen German Remote Sensing Data Center . . . e-mail: markus.wolfgart at dlr.de ********************************************************** ----- "Markus Wolfgart" wrote: | Hallo Cluster and GFS Experts, | | I'm a new subscriber of this mailing list and appologise | in the case my posting is offtopic. | | I'm looking for help concerning a corrupt gfs2 file system | which could not be recovered by me by fsck.gfs2 (Ver. 3.0.9) | due to to less less physical memory (4GB) eaven if increasing it | by a additional swap space (now about 35GB). | | I would like to parse a image created of the lost fs (the first 6TB) | with the code provided in the new gfs2-utils release. | | Due to this circumstance I hope to find in this mailing list some | hints | concerning an automated step by step recovery of lost data. | | Many Thanks in advance for your help | | Markus Hi Markus, You said that fsck.gfs2 is not working but you did not say what messages it gives you when you try. This must be a very big file system. How big is it? Was it converted from gfs1? Regards, Bob Peterson Red Hat File Systems From Chris.Jankowski at hp.com Fri Apr 2 11:20:26 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 2 Apr 2010 11:20:26 +0000 Subject: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 Message-ID: <036B68E61A28CA49AC2767596576CD596906CE679D@GVW1113EXC.americas.hpqcorp.net> Hi, As per Red Hat Knowledgebase note 18886 on RHEL 5.4 I should be able to get the current in-memory values of the openAIS paramemters by running the following commands: # openais-confdb-display totem.version = '2' totem.secauth = '1' # openais-confdb-display totem token totem.token = '10000' # openais-confdb-display totem consensustotem.consensus = '4800' # openais-confdb-display totem token_retransmits_before_loss_const totem.token_retransmits_before_loss_const = '20' # openais-confdb-display cman quorum_dev_poll cman.quorum_dev_poll = '40000' # openais-confdb-display cman expected_votes cman.expected_votes = '3' # openais-confdb-display cman two_node cman.two_node = '1' On my 5.4 cluster it works for the first 4 commands, but for the last 3 commands I get, respectively: Could not get "quorum _dev_poll" :1 Could not get "expected_votes" :1 Could not get "two_node" :1 Anybody knows what is going on here? Thanks and regards, Chris From Chris.Jankowski at hp.com Fri Apr 2 11:26:58 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 2 Apr 2010 11:26:58 +0000 Subject: [Linux-cluster] Is nit worth setting up jumbo Ethernet frames on the cluster interconnect link? Message-ID: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> Hi, On a heavily used cluster with GFS2 is it worth setting up jumbo Ethernet frames on the cluster interconnect link? Obviously, if only miniscule portion of the packets travelling through this link are larger than standard 1500 MTU then why to bother. I am seeing significant traffic on the link up to 50,000 packets per second. Thanks and regards, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Fri Apr 2 14:00:37 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Fri, 2 Apr 2010 10:00:37 -0400 Subject: [Linux-cluster] Is nit worth setting up jumbo Ethernet frames on the cluster interconnect link? In-Reply-To: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> Message-ID: <64D0546C5EBBD147B75DE133D798665F055D8DEC@hugo.eprize.local> You probably wouldn't gain much, if anything. I see packets averaging around 130-160 bytes in size on one of our clusters. Fewer than 1% of packets are greater than 1400 bytes in size. On the other hand, if you are using Ethernet for storage, you almost certainly want jumbo frames on any such interfaces. Jeff From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jankowski, Chris Sent: Friday, April 02, 2010 7:27 AM To: linux clustering Subject: [Linux-cluster] Is nit worth setting up jumbo Ethernet frames on the cluster interconnect link? Hi, On a heavily used cluster with GFS2 is it worth setting up jumbo Ethernet frames on the cluster interconnect link? Obviously, if only miniscule portion of the packets travelling through this link are larger than standard 1500 MTU then why to bother. I am seeing significant traffic on the link up to 50,000 packets per second. Thanks and regards, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.wolfgart at dlr.de Fri Apr 2 15:52:42 2010 From: markus.wolfgart at dlr.de (Markus Wolfgart) Date: Fri, 02 Apr 2010 17:52:42 +0200 Subject: [Linux-cluster] gfs2-utils source for recovery purpose of a corrupt gfs2 partition Message-ID: <4BB612CA.1080808@dlr.de> Hi Cluster/GFS Experts, I was playing arround with the libgfs2 and the fsck source and start the function "initialize" on my 30GB small fragment of the binary image of the corrupted partition hosted on my notebook. ... if ((sdp = (struct gfs2_sbd*)calloc(1,sizeof(struct gfs2_sbd)))==NULL) { printf( _("sdp-adr!! = %p\n"), sdp); //fill_super_block(sdp); } else { printf( _("sdp-adr = %p\n"), sdp); //retval=fill_super_block(sdp); retval=initialize(sdp, 0, 0, &all_clean); } ... The output I get from it, besides my own is listed below. >bin/Release> ./recover_file_from_gfs2_image /mnt/mybook/GFS2-Problem/dev_oa_vg_storage1_oa_lv_storage1.bin current lock protocol name = "fsck_dlm" current lock table name = "oa-dp:oa_gfs1" current ondisk format = 1801 current multihost format = 1900 current block size = 4096 current block size shift = 12 masterdir-addr = 51 masterdir-fino = 2 rootdir-fino = 50 rootdir-fino = 1 dummy_bh.sdp = 0x4016c0 sdp-adr = 0x619010 Validating Resource Group index. Level 1 RG check. (level 1 failed) Level 2 RG check. L2: number of rgs in the index = 11541. L2: number of rgs expected = 150. L2: They don't match; either (1) the fs was extended, (2) an odd L2: rg size was used, or (3) we have a corrupt rg index. (level 2 failed) Level 3 RG check. RG 2 is damaged: getting dist from index: 0x8149b RG 1 at block 0x11 intact [length 0x8149b] RG 2 at block 0x814AC intact [length 0x21fd1] RG 3 at block 0xA347D intact [length 0x21fd1] * RG 4 at block 0xC544E *** DAMAGED *** [length 0x21fd1] * RG 5 at block 0xE741F *** DAMAGED *** [length 0x21fd1] * RG 6 at block 0x1093F0 *** DAMAGED *** [length 0x21fd1] * RG 7 at block 0x12B3C1 *** DAMAGED *** [length 0x21fd1] Error: too many bad RGs. Error rebuilding rg list. (level 3 failed) RG recovery impossible; I can't fix this file system. sdp->blks_alloced = 3071213568 sdp->blks_total = 1711852225 A second run on a 50GB fragment run in an oom, despite 4GB Ram and 4GB swap. Validating Resource Group index. Level 1 RG check. (level 1 failed) Level 2 RG check. Out of memory in compute_rgrp_layout So how much virtual memory should be provideed to get a succesfull run for 128k big RGs let say for a 1TB big fs, when only 50GB data cause to alloc more then 8GB memory? Bye and many thanks for information Markus Hi Cluster/GFS Experts, Hi Bob, as I get no response concerning my recovery issue, I would like to summarize my activities, which could help someone else running in such a problem with gfs2. As the corrupted gfs2 (12TB b4 grow 25TB after) was hosted on a SE6540 disk array and the master is a Sun X4150 4GB machine with a CentOS 5.3 (i686/PAE), I run in the out of memory problem during the run of fsck.gfs2. No matter what i have done, I was not able even use the temporary swap file as found in some postings suggested. As the os installation was done by other guys and they insist on this configuration, I boot a rescue x86_64 dvd in order to overcame the memory restriction. In addition to this I was lucky to have some spare memory to increase the ram to 16GB. As I don't like to run the lvm/cman software as well as honestly speeking not having much experience on this, I create and mount a large xfs partition an the disk array to create a temporary swap file and to store the files I hope to recover from the corrupted gfs2 partition. An investigation via dd | od -c on the first mb of the gfs2 partition reveal that after the lvm2 block of a size of 192k the sb (super block) of the gfs2 starts. creating an loopback device with an offset of 196608 bytes let my access the file system via fsck without dlm/clvm etc. losetup /dev/loop4 /dev/sdb -o 196608 /sbin/fsck.gfs2 -f -p -y -v /dev/loop4 The index of the loop device depends on the usage of the rescue system. Check it with losetup -a and take a number which is not currently used. After some attempts on checking the gfs2 running again in the oom my temp swap space is now about 0.7TG (no joke). I start with 20GB of swap space and double the size every oom abort of fsck. Now I was lucky to pass the first and run into the second check Initializing fsck Initializing lists... jid=0: Looking at journal... jid=0: Journal is clean. jid=1: Looking at journal... jid=1: Journal is clean. jid=2: Looking at journal... jid=2: Journal is clean. jid=3: Looking at journal... jid=3: Journal is clean. jid=4: Looking at journal... jid=4: Journal is clean. jid=5: Looking at journal... jid=5: Journal is clean. jid=6: Looking at journal... jid=6: Journal is clean. jid=7: Looking at journal... jid=7: Journal is clean. Initializing special inodes... Validating Resource Group index. Level 1 RG check. Level 2 RG check. Existing resource groups: 1: start: 17 (0x11), length = 529563 (0x8149b) 2: start: 529580 (0x814ac), length = 524241 (0x7ffd1) 3: start: 1053821 (0x10147d), length = 524241 (0x7ffd1) 4: start: 1578062 (0x18144e), length = 524241 (0x7ffd1) ... 9083643: start: 4762017571061 (0x454be5da0f5), length = 524241 (0x7ffd1) 9083644: start: 4762018095302 (0x454be65a0c6), length = 524241 (0x7ffd1) 9083645: start: 4762018619543 (0x454be6da097), length = 524241 (0x7ffd1) 9083646: start: 4762019143784 (0x454be75a068), length = 524241 (0x7ffd1) ... In addition to this I start to explore the code of gfs2-utils (folder libgfs2 and folder fsck) and was able to list the super block infos. As mentioned im my previous posting I was able to list all my file names of interest located in a 7TB big image created from the dd output. all files I'm looking for found in the directory structure (about 16 tousend) could be seen by a simple od -s (string mode) or by the xxd command. xxd -a -u -c 64 -s 671088640 dev_oa_vg_storage1_oa_lv_storage1.bin | less The first snippet of code I'm used to play around looks like listed below and is just plain a cut and paste of the utils code: The code just show some information of the super block. #include #include #include #include #include #include #include #include #include #define _(String) gettext(String) #include "gfs2structure.h" int main(int argc, char *argv[]) { int fd; char *device, *field; unsigned char buf[GFS2_BASIC_BLOCK]; unsigned char input[256]; unsigned char output[256]; struct gfs2_sb sb; struct gfs2_buffer_head dummy_bh; struct gfs2_dirent dirent,*dentp;; //struct gfs2_inum sbmd; //struct gfs2_inum sbrd; dummy_bh.b_data = (char *)buf; //memset(&dirent, 0, sizeof(struct gfs2_dirent)); device = argv[1]; fd = open(device, O_RDONLY); if (fd < 0) die("can't open %s: %s\n", device, strerror(errno)); if (lseek(fd, GFS2_SB_ADDR * GFS2_BASIC_BLOCK, SEEK_SET) != GFS2_SB_ADDR * GFS2_BASIC_BLOCK) { fprintf(stderr, _("bad seek: %s from %s:%d: superblock\n"), strerror(errno), __FUNCTION__, __LINE__); exit(-1); } if (read(fd, buf, GFS2_BASIC_BLOCK) != GFS2_BASIC_BLOCK) { fprintf(stderr, _("bad read: %s from %s:%d: superblock\n"), strerror(errno), __FUNCTION__, __LINE__); exit(-1); } gfs2_sb_in(&sb, &dummy_bh); if (sb.sb_header.mh_magic != GFS2_MAGIC || sb.sb_header.mh_type != GFS2_METATYPE_SB) die( _("there isn't a GFS2 filesystem on %s\n"), device); printf( _("current lock protocol name = \"%s\"\n"),sb.sb_lockproto); printf( _("current lock table name = \"%s\"\n"),sb.sb_locktable); printf( _("current ondisk format = %u\n"),sb.sb_fs_format); printf( _("current multihost format = %u\n"),sb.sb_multihost_format); //printf( _("current uuid = %s\n"), str_uuid(sb.sb_uuid)); printf( _("current block size = %u\n"), sb.sb_bsize); printf( _("current block size shift = %u\n"), sb.sb_bsize_shift); printf( _("masterdir-addr = %u\n"), sb.sb_master_dir.no_addr); printf( _("masterdir-fino = %u\n"), sb.sb_master_dir.no_formal_ino); printf( _("rootdir-fino = %u\n"), sb.sb_root_dir.no_addr); printf( _("rootdir-fino = %u\n"), sb.sb_root_dir.no_formal_ino); printf( _("dummy_bh.sdp = %p\n"), dummy_bh.sdp); printf( _("sdp->blks_alloced = %u\n"), dummy_bh.sdp->blks_alloced); printf( _("sdp->blks_total = %u\n"), dummy_bh.sdp->blks_total); printf( _("sdp->device_name = %s\n"), dummy_bh.sdp->device_name); //gfs2_dirent_in(&dirent, (char *)dentp); //gfs2_dirent_print(&dirent, output); //gfs2_dinode_print(struct gfs2_dinode *di); close(fd); } I will keep you all informed on the progress of this story. My next step will be - depending on the further progress of the fsck - (if it fails or not) to overwrite the "lock_" and/or "fsck_" flags in the image and to mount the gfs2 image to see what happens. Meanwhile during the run of fsck which could take a while (used swap space now is more the 510GB) as I was told, I hope someone could show me how to run through the inodes using libgfs2 to collect data from them or to point me to the right direction. Many Thanks in Advance and a nice Easter weekend. Bye Markus ******************************************************* Markus Wolfgart DLR Oberpfaffenhofen German Remote Sensing Data Center . . . e-mail: markus.wolfgart at dlr.de ********************************************************** Hi Bob, thanks for prompt reply! the fs originally was 12.4TB (6TB used) big. After a resize attempt to 25TB by gfs2_grow (very very old version gfs2-utils 1.62) The fs was expand and the first impression looks good as df reported the size of 25TB. But looking from the second node to the fs (two nod system) ls -r and ls -R throws IO errors and gfs2 mount get frozen (reboot of machine was performed). As no shrinking of gfs2 was possible to rollback, the additional physical volume was removed from the logical volume (lvresize to org. size & pvremove). This hard cut of the gsf2 unfenced partition should be hopefully repaired by the fsck.gfs2 (newest version), this was my thought. Even if this will not be the case, I could not run the fsck.gfs2 due to a "of memory in compute_rgrp_layout" message. see strace output: write(1, "9098813: start: 4769970307031 (0"..., 739098813: start: 4769970307031 (0x4569862bfd7), length = 524241 (0x7ffd1) ) = 73 write(1, "9098814: start: 4769970831272 (0"..., 739098814: start: 4769970831272 (0x456986abfa8), length = 524241 (0x7ffd1) ) = 73 write(1, "9098815: start: 4769971355513 (0"..., 739098815: start: 4769971355513 (0x4569872bf79), length = 524241 (0x7ffd1) ) = 73 write(1, "9098816: start: 4769971879754 (0"..., 739098816: start: 4769971879754 (0x456987abf4a), length = 524241 (0x7ffd1) ) = 73 write(1, "9098817: start: 4769972403995 (0"..., 739098817: start: 4769972403995 (0x4569882bf1b), length = 524241 (0x7ffd1) ) = 73 brk(0xb7dea000) = 0xb7dc9000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) mmap2(NULL, 1048576, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory) write(2, "Out of memory in compute_rgrp_la"..., 37Out of memory in compute_rgrp_layout ) = 37 exit_group(-1) = ? As I had already increased my swapspace swapon -s Filename Type Size Used Priority /dev/sda3 partition 8385920 0 -3 /var/swapfile.bin file 33554424 144 1 and run again the same situation as before I decide to start to extract the lost files by a c prog. Now I have create a big Image (7TB) on a xfs partition and would like to recover my files of interest by a program using libgfs2 or part of the source from gfs2-utils, as mentioned in my previous posting. As I see nearly all of the files located in the dir structure and get the position in the image by a simple string command, I hope to extract them in a simpler way. The RG size was set to the Max value of 2GB end each file I'm looking for is about 250BM big. The amount of files to be recovered is more then 16k. Every file have a header with his file name ant the total size, so it should be easy to check if the recovery of it is successful. So thats my theory, but this could be a easter vacation project without the right knowledge of gfs2. As I'm lucky to have the gfs2-utils source I hope it could be done. But if there is a simpler way to do a recovery by the installed gfs2 progs like gfs2_edit or gfs2_tool or other tools it would be nice if someone could show my the proper way. Many Thanks in advance Markus -- ******************************************************* Markus Wolfgart DLR Oberpfaffenhofen German Remote Sensing Data Center . . . e-mail: markus.wolfgart at dlr.de ********************************************************** ----- "Markus Wolfgart" wrote: | Hallo Cluster and GFS Experts, | | I'm a new subscriber of this mailing list and appologise | in the case my posting is offtopic. | | I'm looking for help concerning a corrupt gfs2 file system | which could not be recovered by me by fsck.gfs2 (Ver. 3.0.9) | due to to less less physical memory (4GB) eaven if increasing it | by a additional swap space (now about 35GB). | | I would like to parse a image created of the lost fs (the first 6TB) | with the code provided in the new gfs2-utils release. | | Due to this circumstance I hope to find in this mailing list some | hints | concerning an automated step by step recovery of lost data. | | Many Thanks in advance for your help | | Markus Hi Markus, You said that fsck.gfs2 is not working but you did not say what messages it gives you when you try. This must be a very big file system. How big is it? Was it converted from gfs1? Regards, Bob Peterson Red Hat File Systems From jcasale at activenetwerx.com Fri Apr 2 16:36:33 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Fri, 2 Apr 2010 16:36:33 +0000 Subject: [Linux-cluster] Is nit worth setting up jumbo Ethernet frames on the cluster interconnect link? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D8DEC@hugo.eprize.local> References: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> <64D0546C5EBBD147B75DE133D798665F055D8DEC@hugo.eprize.local> Message-ID: >On the other hand, if you are using Ethernet for storage, you almost certainly want jumbo frames on any such interfaces. Can't say I agree with a simple broad statement like that. Rather than restate what smarter people than me have already done, here is a quote from a dev in the IET list I have come to trust: http://old.nabble.com/Re%3A-Performance-increase-p10218304.html Test with and without, simply enabling jumbo's isn't certainly going to help, it may or may not. From arunkp1987 at gmail.com Sat Apr 3 11:32:29 2010 From: arunkp1987 at gmail.com (Arun Kp) Date: Sat, 3 Apr 2010 17:02:29 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 71, Issue 46 In-Reply-To: References: Message-ID: Dear All , May I Know how to do the Active-Active Clustering in RHEL 5.4 -- Thanks&Regards, Arun K P HCL Infosystems Ltd Kolkata 26 On 30 March 2010 21:30, wrote: > Send Linux-cluster mailing list submissions to > linux-cluster at redhat.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://www.redhat.com/mailman/listinfo/linux-cluster > or, via email, send a message with subject or body 'help' to > linux-cluster-request at redhat.com > > You can reach the person managing the list at > linux-cluster-owner at redhat.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Linux-cluster digest..." > > > Today's Topics: > > 1. RHEL5.4: conga luci - Runtime Error: maximum recursion depth > exceeded (Hofmeister, James (WTEC Linux)) > 2. why does ip.sh launch rdisc ? (Martin Waite) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 29 Mar 2010 18:11:45 +0000 > From: "Hofmeister, James (WTEC Linux)" > To: "linux-cluster at redhat.com" > Subject: [Linux-cluster] RHEL5.4: conga luci - Runtime Error: maximum > recursion depth exceeded > Message-ID: > < > EC61DD7B6048464AB0E1B713AF7521BC1760ECB4E7 at GVW0676EXC.americas.hpqcorp.net > > > > Content-Type: text/plain; charset="us-ascii" > > Hello All, > RE: RHEL5.4: conga luci - Runtime Error: maximum recursion depth exceeded > > Has anybody seen this? RHEL5.4 with ricci-0.12.2-6.el5_4.1-x86_64 and > luci-0.12.1-7.el5.x86_64: > > Runtime Error > Sorry, a site error occurred. > > Traceback (innermost last): > > * Module ZPublisher.Publish, line 196, in publish_module_standard > > * Module Products.PlacelessTranslationService.PatchStringIO, line > 34, in new_publish > * Module ZPublisher.Publish, line 146, in publish > * Module Zope2.App.startup, line 222, in > zpublisher_exception_hook > * Module ZPublisher.Publish, line 121, in publish > * Module Zope2.App.startup, line 240, in commit > * Module transaction._manager, line 96, in commit > * Module transaction._transaction, line 380, in commit > * Module transaction._transaction, line 378, in commit > * Module transaction._transaction, line 433, in _commitResources > * Module ZODB.Connection, line 484, in commit > * Module ZODB.Connection, line 526, in _commit > * Module ZODB.Connection, line 553, in _store_objects > * Module ZODB.serialize, line 407, in serialize > * Module ZODB.serialize, line 416, in _dump > > Runtime Error: maximum recursion depth exceeded (Also, the following > error occurred while attempting to render the standard error message, > please see the event log for full details: An operation previously > failed, with traceback: File > "/usr/lib64/luci/zope/lib/python/ZServer/PubCore/ZServerPubl > isher.py", line 23, in __init__ response=response) File > "/usr/lib64/luci/zope/lib/python/ZPublisher/Publish.py&q > uot;, line 395, in publish_module environ, debug, request, response) > File > "/usr/lib64/luci/zope/lib/python/ZPublisher/Publish.py&q > uot;, line 196, in publish_module_standard response = > publish(request, module_name, after list, debug=debug) File > "/usr/lib64/luci/zope/lib/python/Products/PlacelessTranslati > onService/PatchStringIO.py", line 34, in new_publish x = > Publish.old_publish(request, module_name, after_list, debug) File > "/usr/lib64/luci/zope/lib/python/ZPublisher/Publish.py&q > uot;, line 121, in publish transactions_manager.commit() File > "/usr/lib64/luci/zope/lib/python/Zope2/App/startup.py&qu > ot;, line 240, in commit transaction.commit() File > "/usr/lib64/luci/zope/lib/python/transaction/_manager.py& > ;quot;, line 96, in commit return self.get().commit(sub, > deprecation_wng=False) File > "/usr/lib64/luci/zope/lib/python/transaction/_transaction.py > ", line 380, in commit self._saveCommitishError() # This > raises! File > "/usr/lib64/luci/zope/lib/python/transaction/_transaction.py > ", line 378, in commit self._commitResources() File > "/usr/lib64/luci/zope/lib/python/transaction/_transaction.py > ", line 433, in _commitResources rm.commit(self) File > "/usr/lib64/luci/zope/lib/python/ZODB/Connection.py" > ;, line 484, in commit self._commit(transaction) File > "/usr/lib64/luci/zope/lib/python/ZODB/Connection.py" > ;, line 526, in _commit self._store_objects(ObjectWriter(obj), > transaction) File > "/usr/lib64/luci/zope/lib/python/ZODB/Connection.py" > ;, line 553, in _store_objects p = writer.serialize(obj) # This calls > __getstate__ of obj File > "/usr/lib64/luci/zope/lib/python/ZODB/serialize.py" > , line 407, in serialize return self._dump(meta, obj.__getstate__()) > File > "/usr/lib64/luci/zope/lib/python/ZODB/serialize.py" > , line 416, in _dump self._p.dump(state) RuntimeError: maximum > recursion depth exceeded ) > > Regards, > James Hofmeister > > > > > ------------------------------ > > Message: 2 > Date: Tue, 30 Mar 2010 05:11:51 +0100 > From: "Martin Waite" > To: > Subject: [Linux-cluster] why does ip.sh launch rdisc ? > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > Hi, > > I have noticed that rdisc - apparently a router discovery protocol daemon - > has started running on nodes that take possession of a VIP using ip.sh. > > I am not familiar with rdisc. It is currently installed on all my RHEL > hosts, but is not running. > > Do I need to run rdisc ? > > Also, the man page says that rdisc uses 224.0.0.1 as a multicast address. > So does my current cman configuration. Should I configure cman to avoid > this address ? > > regards, > Martin > > > > > ------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > End of Linux-cluster Digest, Vol 71, Issue 46 > ********************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakov.sosic at srce.hr Sun Apr 4 01:44:10 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sun, 04 Apr 2010 03:44:10 +0200 Subject: [Linux-cluster] rgmanager and clvm don't work after reboot In-Reply-To: <4BAA2A1C.7020106@srce.hr> References: <4BA92052.5020806@srce.hr> <4BA94112.7010005@srce.hr> <4BAA2A1C.7020106@srce.hr> Message-ID: <4BB7EEEA.1020706@srce.hr> On 03/24/2010 04:05 PM, Jakov Sosic wrote: > On 03/23/2010 11:30 PM, Jakov Sosic wrote: >> On 03/23/2010 09:10 PM, Jakov Sosic wrote: >> >>> I this a similar issue? Services trying to communicate with member 0, >>> which is a qdisk and not a real member? :-/ >> >> >> If I start clvmd with "-d 2" options (debug), I get this: >> >> # /etc/init.d/clvmd-debug start >> Starting clvmd: CLVMD[eefac820]: Mar 23 23:23:32 CLVMD started >> CLVMD[eefac820]: Mar 23 23:23:32 Connected to CMAN >> CLVMD[eefac820]: Mar 23 23:23:32 CMAN initialisation complete >> >> and it hangs there... > > Also it seems that this disturbed all the instances of clvm on other > nodes too :( So now I can't 'lvs' or 'vgs' on any node... > > It seems that cluster restart is imminent :( > > This seems like a horrible bug :( > > So no opinions on this one? I still have a locked and unuseable cluster which I have to reboot because of this :( And the worst part is problem is reproducable - it's enough to leave it a couple of days on, and then reboot a single node... -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From Alain.Hoang at hp.com Sun Apr 4 15:52:43 2010 From: Alain.Hoang at hp.com (Hoang, Alain) Date: Sun, 4 Apr 2010 15:52:43 +0000 Subject: [Linux-cluster] RHCS: Multi site cluster Message-ID: <58C6777539C300489D145B0F8E29C3281ACBB7D7B7@GVW0673EXC.americas.hpqcorp.net> Hello, With RHCS 5.4, is it possible to build a cluster on multiple site? Does CLVM 5.4 allows SAN replication across 2 sites? Best Regards, Ki?n L?m Alain Hoang, -------------- next part -------------- An HTML attachment was scrubbed... URL: From scooter at cgl.ucsf.edu Sun Apr 4 20:58:25 2010 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Sun, 04 Apr 2010 13:58:25 -0700 Subject: [Linux-cluster] RHEL 5.5 Crash in gfs2 Message-ID: <4BB8FD71.3080300@plato.cgl.ucsf.edu> Hi all, We recently upgraded to 5.5 (kernel 2.6.18-194.el5) to get some of the gfs2 fixes on a 3 node cluster, but crashed two days later with the following stack trace: [2010-04-04 10:28:48]Unable to handle kernel NULL pointer dereference at 0000000000000078 RIP: ^M [2010-04-04 10:28:48] [] :gfs2:revoke_lo_add+0x1a/0x32^M [2010-04-04 10:28:48]PGD 7d4297067 PUD 13a24c067 PMD 0 ^M [2010-04-04 10:28:48]Oops: 0002 [1] SMP ^M [2010-04-04 10:28:48]last sysfs file: /devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq^M [2010-04-04 10:28:48]CPU 8 ^M [2010-04-04 10:28:48]Modules linked in: ipt_MASQUERADE iptable_nat ip_nat bridge autofs4 hidp l2cap bluetooth lock_dlm gfs2 dlm configfs lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle arptable_filter arp_tables x_tables ib_iser libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg ide_cd bnx2 cdrom hpilo serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M [2010-04-04 10:28:49]Pid: 795, comm: kswapd0 Not tainted 2.6.18-194.el5 #1^M [2010-04-04 10:28:49]RIP: 0010:[] [] :gfs2:revoke_lo_add+0x1a/0x32^M [2010-04-04 10:28:49]RSP: 0018:ffff81082efcdae8 EFLAGS: 00010282^M [2010-04-04 10:28:49]RAX: 0000000000000000 RBX: ffff810256e037f0 RCX: ffff8100207fd180^M [2010-04-04 10:28:49]RDX: ffff81051abdf630 RSI: ffff810819032720 RDI: ffff810819032000^M [2010-04-04 10:28:49]RBP: ffff81051abdf610 R08: ffff81011cb31b06 R09: ffff81082efcdb20^M [2010-04-04 10:28:49]R10: ffff8101135d8330 R11: ffffffff887dc3b9 R12: ffff810819032000^M [2010-04-04 10:28:49]R13: 0000000000000000 R14: ffff810256e037f0 R15: ffff810819032000^M [2010-04-04 10:28:50]FS: 0000000000000000(0000) GS:ffff81011cb319c0(0000) knlGS:0000000000000000^M [2010-04-04 10:28:50]CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M [2010-04-04 10:28:50]CR2: 0000000000000078 CR3: 000000024ed62000 CR4: 00000000000006e0^M [2010-04-04 10:28:50]Process kswapd0 (pid: 795, threadinfo ffff81082efcc000, task ffff81082f5d17a0)^M [2010-04-04 10:28:50]Stack: ffffffff887dd88c 000000002efcde10 ffff810256e037f0 ffff81011c7fadd8^M [2010-04-04 10:28:50] 0000000000000000 0000000000000000 ffffffff887deaf6 000000000000000e^M [2010-04-04 10:28:50] ffff81011c7fadd8 00000000000000b0 ffff81082efcdcf0 ffff810819032000^M [2010-04-04 10:28:50]Call Trace:^M [2010-04-04 10:28:50] [] :gfs2:gfs2_remove_from_journal+0x11f/0x131^M [2010-04-04 10:28:50] [] :gfs2:gfs2_invalidatepage+0xea/0x151^M [2010-04-04 10:28:50] [] :gfs2:gfs2_writepage_common+0x95/0xb1^M [2010-04-04 10:28:50] [] :gfs2:gfs2_jdata_writepage+0x56/0x98^M [2010-04-04 10:28:50] [] shrink_inactive_list+0x3fd/0x8d8^M [2010-04-04 10:28:50] [] __pagevec_release+0x19/0x22^M [2010-04-04 10:28:51] [] shrink_active_list+0x4b4/0x4c4^M [2010-04-04 10:28:51] [] shrink_zone+0x127/0x18d^M [2010-04-04 10:28:51] [] kswapd+0x323/0x46c^M [2010-04-04 10:28:51] [] autoremove_wake_function+0x0/0x2e^M [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M [2010-04-04 10:28:51] [] kswapd+0x0/0x46c^M [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M [2010-04-04 10:28:51] [] kthread+0xfe/0x132^M [2010-04-04 10:28:51] [] request_module+0x0/0x14d^M [2010-04-04 10:28:51] [] child_rip+0xa/0x11^M [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M [2010-04-04 10:28:51] [] kthread+0x0/0x132^M [2010-04-04 10:28:51] [] child_rip+0x0/0x11^M This looks exactly like bug 437803, but that was closed early last year. Does anyone have any ideas what might be going on? I double-checked, and we definitely do not have the old kmod-gfs2 installed. -- scooter From scooter at cgl.ucsf.edu Sun Apr 4 21:07:23 2010 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Sun, 04 Apr 2010 14:07:23 -0700 Subject: [Linux-cluster] RHEL 5.5 Crash in gfs2 In-Reply-To: <4BB8FD71.3080300@plato.cgl.ucsf.edu> References: <4BB8FD71.3080300@plato.cgl.ucsf.edu> Message-ID: <4BB8FF8B.5040201@plato.cgl.ucsf.edu> I should point out that this is an exact duplicate of the crash we saw on 3/17 on RHEL 5.4. We're not explicitly doing any journaled files, although we did have a number of MySQL databases running on that node. -- scooter On 04/04/2010 01:58 PM, Scooter Morris wrote: > Hi all, > We recently upgraded to 5.5 (kernel 2.6.18-194.el5) to get some of > the gfs2 fixes on a 3 node cluster, but crashed two days later with > the following stack trace: > > [2010-04-04 10:28:48]Unable to handle kernel NULL pointer dereference > at 0000000000000078 RIP: ^M > [2010-04-04 10:28:48] [] > :gfs2:revoke_lo_add+0x1a/0x32^M > [2010-04-04 10:28:48]PGD 7d4297067 PUD 13a24c067 PMD 0 ^M > [2010-04-04 10:28:48]Oops: 0002 [1] SMP ^M > [2010-04-04 10:28:48]last sysfs file: > /devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq^M > > [2010-04-04 10:28:48]CPU 8 ^M > [2010-04-04 10:28:48]Modules linked in: ipt_MASQUERADE iptable_nat > ip_nat bridge autofs4 hidp l2cap bluetooth lock_dlm gfs2 dlm configfs > lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink > xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle > arptable_filter arp_tables x_tables ib_iser libiscsi2 > scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib > ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm > ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core > dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter > hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi > acpi_memhotplug ac parport_pc lp parport sg ide_cd bnx2 cdrom hpilo > serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache > dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc > ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd > ohci_hcd ehci_hcd^M > [2010-04-04 10:28:49]Pid: 795, comm: kswapd0 Not tainted > 2.6.18-194.el5 #1^M > [2010-04-04 10:28:49]RIP: 0010:[] > [] :gfs2:revoke_lo_add+0x1a/0x32^M > [2010-04-04 10:28:49]RSP: 0018:ffff81082efcdae8 EFLAGS: 00010282^M > [2010-04-04 10:28:49]RAX: 0000000000000000 RBX: ffff810256e037f0 RCX: > ffff8100207fd180^M > [2010-04-04 10:28:49]RDX: ffff81051abdf630 RSI: ffff810819032720 RDI: > ffff810819032000^M > [2010-04-04 10:28:49]RBP: ffff81051abdf610 R08: ffff81011cb31b06 R09: > ffff81082efcdb20^M > [2010-04-04 10:28:49]R10: ffff8101135d8330 R11: ffffffff887dc3b9 R12: > ffff810819032000^M > [2010-04-04 10:28:49]R13: 0000000000000000 R14: ffff810256e037f0 R15: > ffff810819032000^M > [2010-04-04 10:28:50]FS: 0000000000000000(0000) > GS:ffff81011cb319c0(0000) knlGS:0000000000000000^M > [2010-04-04 10:28:50]CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M > [2010-04-04 10:28:50]CR2: 0000000000000078 CR3: 000000024ed62000 CR4: > 00000000000006e0^M > [2010-04-04 10:28:50]Process kswapd0 (pid: 795, threadinfo > ffff81082efcc000, task ffff81082f5d17a0)^M > [2010-04-04 10:28:50]Stack: ffffffff887dd88c 000000002efcde10 > ffff810256e037f0 ffff81011c7fadd8^M > [2010-04-04 10:28:50] 0000000000000000 0000000000000000 > ffffffff887deaf6 000000000000000e^M > [2010-04-04 10:28:50] ffff81011c7fadd8 00000000000000b0 > ffff81082efcdcf0 ffff810819032000^M > [2010-04-04 10:28:50]Call Trace:^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_remove_from_journal+0x11f/0x131^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_invalidatepage+0xea/0x151^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_writepage_common+0x95/0xb1^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_jdata_writepage+0x56/0x98^M > [2010-04-04 10:28:50] [] > shrink_inactive_list+0x3fd/0x8d8^M > [2010-04-04 10:28:50] [] __pagevec_release+0x19/0x22^M > [2010-04-04 10:28:51] [] > shrink_active_list+0x4b4/0x4c4^M > [2010-04-04 10:28:51] [] shrink_zone+0x127/0x18d^M > [2010-04-04 10:28:51] [] kswapd+0x323/0x46c^M > [2010-04-04 10:28:51] [] > autoremove_wake_function+0x0/0x2e^M > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kswapd+0x0/0x46c^M > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kthread+0xfe/0x132^M > [2010-04-04 10:28:51] [] request_module+0x0/0x14d^M > [2010-04-04 10:28:51] [] child_rip+0xa/0x11^M > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kthread+0x0/0x132^M > [2010-04-04 10:28:51] [] child_rip+0x0/0x11^M > > This looks exactly like bug 437803, but that was closed early last > year. Does anyone have any ideas what might be going on? I > double-checked, and we definitely do not have the old kmod-gfs2 > installed. > > -- scooter > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From jumanjiman at gmail.com Mon Apr 5 01:07:39 2010 From: jumanjiman at gmail.com (Paul Morgan) Date: Sun, 4 Apr 2010 21:07:39 -0400 Subject: [Linux-cluster] RHCS: Multi site cluster In-Reply-To: <58C6777539C300489D145B0F8E29C3281ACBB7D7B7@GVW0673EXC.americas.hpqcorp.net> References: <58C6777539C300489D145B0F8E29C3281ACBB7D7B7@GVW0673EXC.americas.hpqcorp.net> Message-ID: You can do it, but you need to ensure low latency and design the stack to minimize the wide area replication. In other words, use fiber. Put the replication traffic on a separate segment if possible, and definitely in a separate layer from the application traffic. Also: double check with your Red Hat sales team on support. On Apr 4, 2010 12:11 PM, "Hoang, Alain" wrote: Hello, With RHCS 5.4, is it possible to build a cluster on multiple site? Does CLVM 5.4 allows SAN replication across 2 sites? Best Regards, Ki?n L?m Alain Hoang, -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From raju.rajsand at gmail.com Mon Apr 5 03:24:13 2010 From: raju.rajsand at gmail.com (Rajagopal Swaminathan) Date: Mon, 5 Apr 2010 08:54:13 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 71, Issue 46 In-Reply-To: References: Message-ID: Greetings, On Sat, Apr 3, 2010 at 5:02 PM, Arun Kp wrote: > Dear All , > > > May I Know how to do the Active-Active Clustering in RHEL 5.4 > First of don't top post. I know it is a common stupid practice in most company e-mail id-s. Active ac-tive cluster means the services that are controlled by cluster are available from both nodes. For examples LTSP. But this is not applicable for databases such as mysql, postgresql or Oracle. For databases you would require special software such as RAC for oracle and mysql cluster for Mysql to maintain cache coherancey. Active/Active configuration yields high availability _and_ load balancing if configured properly. Also google somewhat before asking such preliminary questions. Regards, Rajagopal From Chris.Jankowski at hp.com Mon Apr 5 03:39:22 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 5 Apr 2010 03:39:22 +0000 Subject: [Linux-cluster] RHCS: Multi site cluster In-Reply-To: References: <58C6777539C300489D145B0F8E29C3281ACBB7D7B7@GVW0673EXC.americas.hpqcorp.net> Message-ID: <036B68E61A28CA49AC2767596576CD596906CE693C@GVW1113EXC.americas.hpqcorp.net> Another comment: You can certainly do it, but you may be surprised that the result may not be neither as resilient generally nor as highly available as initially hoped, due to the limitations of the cluster subsystems. I'll give just two examples where you may hit unresolvable difficulties - the first of them is obvious and the second one much more subtle: 1. Fencing. Assume that the link between sites A and B is severed and site A retained quorum whereas site B for the sake of the simple example did not. A node on site A will try to fence nodes on site B, but it cannot complete it, as the links are down. So fencing will bleck and you would have no access to GFS2 filesystems on site A, as they are awaiting recovery of locks that can only commence when fencing is completed. This problem is inherent to the whole philosophy of fencing. Fencing model assumes that the fencing node has full control over the whole environment no matter what happens to the environment. IMHO, this works reasonably well for two computers under the desk with a power switch, but fails miserably over distance. Please note that clusters other than RHEL Cluster Suite do not implement the fencing stance i.e. they do not assume that a node is omnipotent. Their stance is much more conservative with regard of what a node can achieve within the environment. The difference looks minor, but consequences are dramatic when working with stretched clusters. There are many kludges you can apply to bend fencing to your will, but you really cannot change the basic design philosophy, I believe. 2. CLVM This is more subtle, If you have a single logical volume built out of 4 physical volumes spread between the sites and you built it as a mirrored volume. Imagine a fire in the data centre. There are scenarios in which the CLVM will report the volume still being OK, but it lost left part of the first half and right part of the second half. Then the last fibre melts and you have two sets in each site, but none of them is consistent. You GFS filesystem built ofn the volume will be corrupted on either site. Again this limitations are architectural: 1. LVM2 and CLVM are missing an intermediate layer between physical volume and logical volume. This layer is called plex layer in some UNIX LVMs for example veritas VxVM and handles characteristics and state of an independent component of a volume. 2. CLVM is missing any notion where the physical volume is located. Once you have this notion you can build the plex quorum in a location and guarantee that a plex in a site will be consistent. Note that this allows the LVM tah has this feature to deal with multiple, non simultaneous failures affecting multiple locations. Standard LVMs are not designed to do that. To the best of my knowledge only the Open VMS cluster has this feature. Perhaps the newest VxVM has it too; I have not worked with it in years. Anybody knows? 3. There are other constraints in the management layer that IMHO result in uncertain recovery scenarios. Again, they mostly stem from lacking the plex layer. -------------- This is why I believe that trying to do a stretched RHEL cluster is a bad idea, unless you do not care about its integrity and resiliency. I always recommend (also for process reasons) splitting of the local HA (automatic replacement of a failed, single, redundant component in a single site) with DR (human in the loop, push button, failower solution). Normally you need an independent HA cluster in each site. RHEL cluster can deliver the HA part, whereas disk replication in storage with integrated push-button failover solution can deliver the other part. Note that if properly impleneted both solutions can deliver crash consistent recovery. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Paul Morgan Sent: Monday, 5 April 2010 11:08 To: linux clustering Subject: Re: [Linux-cluster] RHCS: Multi site cluster You can do it, but you need to ensure low latency and design the stack to minimize the wide area replication. In other words, use fiber. Put the replication traffic on a separate segment if possible, and definitely in a separate layer from the application traffic. Also: double check with your Red Hat sales team on support. On Apr 4, 2010 12:11 PM, "Hoang, Alain" wrote: Hello, With RHCS 5.4, is it possible to build a cluster on multiple site? Does CLVM 5.4 allows SAN replication across 2 sites? Best Regards, Ki?n L?m Alain Hoang, -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From celsowebber at yahoo.com Mon Apr 5 11:01:42 2010 From: celsowebber at yahoo.com (Celso K. Webber) Date: Mon, 5 Apr 2010 04:01:42 -0700 (PDT) Subject: [Linux-cluster] why does ip.sh launch rdisc ? In-Reply-To: References: Message-ID: <819385.7710.qm@web111711.mail.gq1.yahoo.com> Hello, In one of our customers, the rdisc daemon was causing problems because they had an wireless access point connected to the network, so rdisc was setting up a default route to this access point as a gateway automatically. Although the access point could be reconfigured or disabled, the customer preferred to remove the rdisc invocation in the Cluster ip.sh script. I'd rather prefer the rdisc invocation would be an option in cluster.conf, not a default behaviour. Regards, Celso. ----- Original Message ---- From: Martin Waite To: Linux-cluster at redhat.com Sent: Tue, March 30, 2010 1:11:51 AM Subject: [Linux-cluster] why does ip.sh launch rdisc ? Hi, I have noticed that rdisc - apparently a router discovery protocol daemon - has started running on nodes that take possession of a VIP using ip.sh. I am not familiar with rdisc. It is currently installed on all my RHEL hosts, but is not running. Do I need to run rdisc ? Also, the man page says that rdisc uses 224.0.0.1 as a multicast address. So does my current cman configuration. Should I configure cman to avoid this address ? regards, Martin -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From jakov.sosic at srce.hr Mon Apr 5 19:22:32 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Mon, 05 Apr 2010 21:22:32 +0200 Subject: [Linux-cluster] rgmanager and clvm don't work after reboot In-Reply-To: <4BB7EEEA.1020706@srce.hr> References: <4BA92052.5020806@srce.hr> <4BA94112.7010005@srce.hr> <4BAA2A1C.7020106@srce.hr> <4BB7EEEA.1020706@srce.hr> Message-ID: <4BBA3878.3020805@srce.hr> On 04/04/2010 03:44 AM, Jakov Sosic wrote: > On 03/24/2010 04:05 PM, Jakov Sosic wrote: >> On 03/23/2010 11:30 PM, Jakov Sosic wrote: >>> On 03/23/2010 09:10 PM, Jakov Sosic wrote: >>> >>>> I this a similar issue? Services trying to communicate with member 0, >>>> which is a qdisk and not a real member? :-/ >>> >>> >>> If I start clvmd with "-d 2" options (debug), I get this: >>> >>> # /etc/init.d/clvmd-debug start >>> Starting clvmd: CLVMD[eefac820]: Mar 23 23:23:32 CLVMD started >>> CLVMD[eefac820]: Mar 23 23:23:32 Connected to CMAN >>> CLVMD[eefac820]: Mar 23 23:23:32 CMAN initialisation complete >>> >>> and it hangs there... >> >> Also it seems that this disturbed all the instances of clvm on other >> nodes too :( So now I can't 'lvs' or 'vgs' on any node... >> >> It seems that cluster restart is imminent :( >> >> This seems like a horrible bug :( >> >> > > > So no opinions on this one? I still have a locked and unuseable cluster > which I have to reboot because of this :( And the worst part is problem > is reproducable - it's enough to leave it a couple of days on, and then > reboot a single node... > I see a following error on TTY (alt+ctrl+f1): dlm: clvmd: group join failed -512 0 If this helps... Maybe I should post this to a linux-cluster devel list? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From ccaulfie at redhat.com Tue Apr 6 07:40:47 2010 From: ccaulfie at redhat.com (Christine Caulfield) Date: Tue, 06 Apr 2010 08:40:47 +0100 Subject: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 In-Reply-To: <036B68E61A28CA49AC2767596576CD596906CE679D@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596906CE679D@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4BBAE57F.2000601@redhat.com> On 02/04/10 12:20, Jankowski, Chris wrote: > Hi, > > As per Red Hat Knowledgebase note 18886 on RHEL 5.4 I should be able to get the current in-memory values of the openAIS paramemters by running the following commands: > > # openais-confdb-display > totem.version = '2' > totem.secauth = '1' > # openais-confdb-display totem token > totem.token = '10000' > # openais-confdb-display totem consensustotem.consensus = '4800' > # openais-confdb-display totem token_retransmits_before_loss_const > totem.token_retransmits_before_loss_const = '20' > # openais-confdb-display cman quorum_dev_poll > cman.quorum_dev_poll = '40000' > # openais-confdb-display cman expected_votes > cman.expected_votes = '3' > # openais-confdb-display cman two_node > cman.two_node = '1' > > On my 5.4 cluster it works for the first 4 commands, but for the last 3 commands I get, respectively: > > Could not get "quorum _dev_poll" :1 > Could not get "expected_votes" :1 > Could not get "two_node" :1 > > Anybody knows what is going on here? > It means that those values are not in the configuration object database because they aren't in cluster.conf. From that you can infer that the defaults are in place. ie: quorum_dev_poll = 10000 expected_votes = two_node = 0 Chrissie From swhiteho at redhat.com Tue Apr 6 09:44:34 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 06 Apr 2010 10:44:34 +0100 Subject: [Linux-cluster] RHEL 5.5 Crash in gfs2 In-Reply-To: <4BB8FD71.3080300@plato.cgl.ucsf.edu> References: <4BB8FD71.3080300@plato.cgl.ucsf.edu> Message-ID: <1270547074.2594.55.camel@localhost> Hi, Can you open a bugzilla about this? Thanks, Steve. On Sun, 2010-04-04 at 13:58 -0700, Scooter Morris wrote: > Hi all, > We recently upgraded to 5.5 (kernel 2.6.18-194.el5) to get some of > the gfs2 fixes on a 3 node cluster, but crashed two days later with the > following stack trace: > > [2010-04-04 10:28:48]Unable to handle kernel NULL pointer dereference at > 0000000000000078 RIP: ^M > [2010-04-04 10:28:48] [] :gfs2:revoke_lo_add+0x1a/0x32^M > [2010-04-04 10:28:48]PGD 7d4297067 PUD 13a24c067 PMD 0 ^M > [2010-04-04 10:28:48]Oops: 0002 [1] SMP ^M > [2010-04-04 10:28:48]last sysfs file: > /devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq^M > [2010-04-04 10:28:48]CPU 8 ^M > [2010-04-04 10:28:48]Modules linked in: ipt_MASQUERADE iptable_nat > ip_nat bridge autofs4 hidp l2cap bluetooth lock_dlm gfs2 dlm configfs > lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink > xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle > arptable_filter arp_tables x_tables ib_iser libiscsi2 > scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib > ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm > ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core > dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter > hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi > acpi_memhotplug ac parport_pc lp parport sg ide_cd bnx2 cdrom hpilo > serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache > dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc > ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd > ehci_hcd^M > [2010-04-04 10:28:49]Pid: 795, comm: kswapd0 Not tainted 2.6.18-194.el5 #1^M > [2010-04-04 10:28:49]RIP: 0010:[] > [] :gfs2:revoke_lo_add+0x1a/0x32^M > [2010-04-04 10:28:49]RSP: 0018:ffff81082efcdae8 EFLAGS: 00010282^M > [2010-04-04 10:28:49]RAX: 0000000000000000 RBX: ffff810256e037f0 RCX: > ffff8100207fd180^M > [2010-04-04 10:28:49]RDX: ffff81051abdf630 RSI: ffff810819032720 RDI: > ffff810819032000^M > [2010-04-04 10:28:49]RBP: ffff81051abdf610 R08: ffff81011cb31b06 R09: > ffff81082efcdb20^M > [2010-04-04 10:28:49]R10: ffff8101135d8330 R11: ffffffff887dc3b9 R12: > ffff810819032000^M > [2010-04-04 10:28:49]R13: 0000000000000000 R14: ffff810256e037f0 R15: > ffff810819032000^M > [2010-04-04 10:28:50]FS: 0000000000000000(0000) > GS:ffff81011cb319c0(0000) knlGS:0000000000000000^M > [2010-04-04 10:28:50]CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M > [2010-04-04 10:28:50]CR2: 0000000000000078 CR3: 000000024ed62000 CR4: > 00000000000006e0^M > [2010-04-04 10:28:50]Process kswapd0 (pid: 795, threadinfo > ffff81082efcc000, task ffff81082f5d17a0)^M > [2010-04-04 10:28:50]Stack: ffffffff887dd88c 000000002efcde10 > ffff810256e037f0 ffff81011c7fadd8^M > [2010-04-04 10:28:50] 0000000000000000 0000000000000000 ffffffff887deaf6 > 000000000000000e^M > [2010-04-04 10:28:50] ffff81011c7fadd8 00000000000000b0 ffff81082efcdcf0 > ffff810819032000^M > [2010-04-04 10:28:50]Call Trace:^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_remove_from_journal+0x11f/0x131^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_invalidatepage+0xea/0x151^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_writepage_common+0x95/0xb1^M > [2010-04-04 10:28:50] [] > :gfs2:gfs2_jdata_writepage+0x56/0x98^M > [2010-04-04 10:28:50] [] > shrink_inactive_list+0x3fd/0x8d8^M > [2010-04-04 10:28:50] [] __pagevec_release+0x19/0x22^M > [2010-04-04 10:28:51] [] shrink_active_list+0x4b4/0x4c4^M > [2010-04-04 10:28:51] [] shrink_zone+0x127/0x18d^M > [2010-04-04 10:28:51] [] kswapd+0x323/0x46c^M > [2010-04-04 10:28:51] [] > autoremove_wake_function+0x0/0x2e^M > [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kswapd+0x0/0x46c^M > [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kthread+0xfe/0x132^M > [2010-04-04 10:28:51] [] request_module+0x0/0x14d^M > [2010-04-04 10:28:51] [] child_rip+0xa/0x11^M > [2010-04-04 10:28:51] [] keventd_create_kthread+0x0/0xc4^M > [2010-04-04 10:28:51] [] kthread+0x0/0x132^M > [2010-04-04 10:28:51] [] child_rip+0x0/0x11^M > > This looks exactly like bug 437803, but that was closed early last > year. Does anyone have any ideas what might be going on? I > double-checked, and we definitely do not have the old kmod-gfs2 installed. > > -- scooter > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From swhiteho at redhat.com Tue Apr 6 11:29:17 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 06 Apr 2010 12:29:17 +0100 Subject: [Linux-cluster] Is nit worth setting up jumbo Ethernet frames on the cluster interconnect link? In-Reply-To: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596906CE67A0@GVW1113EXC.americas.hpqcorp.net> Message-ID: <1270553357.727.222.camel@localhost.localdomain> Hi, On Fri, 2010-04-02 at 11:26 +0000, Jankowski, Chris wrote: > Hi, > > On a heavily used cluster with GFS2 is it worth setting up jumbo > Ethernet frames on the cluster interconnect link? Obviously, if only > miniscule portion of the packets travelling through this link are > larger than standard 1500 MTU then why to bother. > > I am seeing significant traffic on the link up to 50,000 packets per > second. The answer is probably not. Unless there is an application using the link other than clustering, you are unlikely to improve matters a great deal. The important thing is latency so far as locking traffic goes, not the overall throughput. If you have issues with lots of network traffic, that might indicate contention problems on certain inodes, and that can usually be solved by looking carefully at how the application(s) use the filesystem rather than by altering the networking side of things. If on the other hand you are using iSCSI and/or exporting the GFS2 filesystem via NFS, then using jumbo frames might be an option worth exploring, Steve. From pradhanparas at gmail.com Tue Apr 6 15:13:58 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 6 Apr 2010 10:13:58 -0500 Subject: [Linux-cluster] RHEL 5.5 Crash in gfs2 In-Reply-To: <1270547074.2594.55.camel@localhost> References: <4BB8FD71.3080300@plato.cgl.ucsf.edu> <1270547074.2594.55.camel@localhost> Message-ID: Curious to know. Is this the issue with this particular version of kernel ie kernel 2.6.18-194.el5 or part of new gfs packages too? Reboot to the older kernel works in this case? Paras. On Tue, Apr 6, 2010 at 4:44 AM, Steven Whitehouse wrote: > Hi, > > Can you open a bugzilla about this? Thanks, > > Steve. > > On Sun, 2010-04-04 at 13:58 -0700, Scooter Morris wrote: > > Hi all, > > We recently upgraded to 5.5 (kernel 2.6.18-194.el5) to get some of > > the gfs2 fixes on a 3 node cluster, but crashed two days later with the > > following stack trace: > > > > [2010-04-04 10:28:48]Unable to handle kernel NULL pointer dereference at > > 0000000000000078 RIP: ^M > > [2010-04-04 10:28:48] [] > :gfs2:revoke_lo_add+0x1a/0x32^M > > [2010-04-04 10:28:48]PGD 7d4297067 PUD 13a24c067 PMD 0 ^M > > [2010-04-04 10:28:48]Oops: 0002 [1] SMP ^M > > [2010-04-04 10:28:48]last sysfs file: > > > /devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq^M > > [2010-04-04 10:28:48]CPU 8 ^M > > [2010-04-04 10:28:48]Modules linked in: ipt_MASQUERADE iptable_nat > > ip_nat bridge autofs4 hidp l2cap bluetooth lock_dlm gfs2 dlm configfs > > lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink > > xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle > > arptable_filter arp_tables x_tables ib_iser libiscsi2 > > scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp ib_ipoib > > ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm > > ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core > > dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter > > hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi > > acpi_memhotplug ac parport_pc lp parport sg ide_cd bnx2 cdrom hpilo > > serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache > > dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc > > ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd > > ehci_hcd^M > > [2010-04-04 10:28:49]Pid: 795, comm: kswapd0 Not tainted 2.6.18-194.el5 > #1^M > > [2010-04-04 10:28:49]RIP: 0010:[] > > [] :gfs2:revoke_lo_add+0x1a/0x32^M > > [2010-04-04 10:28:49]RSP: 0018:ffff81082efcdae8 EFLAGS: 00010282^M > > [2010-04-04 10:28:49]RAX: 0000000000000000 RBX: ffff810256e037f0 RCX: > > ffff8100207fd180^M > > [2010-04-04 10:28:49]RDX: ffff81051abdf630 RSI: ffff810819032720 RDI: > > ffff810819032000^M > > [2010-04-04 10:28:49]RBP: ffff81051abdf610 R08: ffff81011cb31b06 R09: > > ffff81082efcdb20^M > > [2010-04-04 10:28:49]R10: ffff8101135d8330 R11: ffffffff887dc3b9 R12: > > ffff810819032000^M > > [2010-04-04 10:28:49]R13: 0000000000000000 R14: ffff810256e037f0 R15: > > ffff810819032000^M > > [2010-04-04 10:28:50]FS: 0000000000000000(0000) > > GS:ffff81011cb319c0(0000) knlGS:0000000000000000^M > > [2010-04-04 10:28:50]CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b^M > > [2010-04-04 10:28:50]CR2: 0000000000000078 CR3: 000000024ed62000 CR4: > > 00000000000006e0^M > > [2010-04-04 10:28:50]Process kswapd0 (pid: 795, threadinfo > > ffff81082efcc000, task ffff81082f5d17a0)^M > > [2010-04-04 10:28:50]Stack: ffffffff887dd88c 000000002efcde10 > > ffff810256e037f0 ffff81011c7fadd8^M > > [2010-04-04 10:28:50] 0000000000000000 0000000000000000 ffffffff887deaf6 > > 000000000000000e^M > > [2010-04-04 10:28:50] ffff81011c7fadd8 00000000000000b0 ffff81082efcdcf0 > > ffff810819032000^M > > [2010-04-04 10:28:50]Call Trace:^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_remove_from_journal+0x11f/0x131^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_invalidatepage+0xea/0x151^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_writepage_common+0x95/0xb1^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_jdata_writepage+0x56/0x98^M > > [2010-04-04 10:28:50] [] > > shrink_inactive_list+0x3fd/0x8d8^M > > [2010-04-04 10:28:50] [] __pagevec_release+0x19/0x22^M > > [2010-04-04 10:28:51] [] > shrink_active_list+0x4b4/0x4c4^M > > [2010-04-04 10:28:51] [] shrink_zone+0x127/0x18d^M > > [2010-04-04 10:28:51] [] kswapd+0x323/0x46c^M > > [2010-04-04 10:28:51] [] > > autoremove_wake_function+0x0/0x2e^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kswapd+0x0/0x46c^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kthread+0xfe/0x132^M > > [2010-04-04 10:28:51] [] request_module+0x0/0x14d^M > > [2010-04-04 10:28:51] [] child_rip+0xa/0x11^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kthread+0x0/0x132^M > > [2010-04-04 10:28:51] [] child_rip+0x0/0x11^M > > > > This looks exactly like bug 437803, but that was closed early last > > year. Does anyone have any ideas what might be going on? I > > double-checked, and we definitely do not have the old kmod-gfs2 > installed. > > > > -- scooter > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scooter at cgl.ucsf.edu Tue Apr 6 15:36:40 2010 From: scooter at cgl.ucsf.edu (Scooter Morris) Date: Tue, 06 Apr 2010 08:36:40 -0700 Subject: [Linux-cluster] RHEL 5.5 Crash in gfs2 In-Reply-To: References: <4BB8FD71.3080300@plato.cgl.ucsf.edu> <1270547074.2594.55.camel@localhost> Message-ID: <4BBB5508.7030106@cgl.ucsf.edu> No, we had seen this crash on the older kernel and were hoping that the new kernel (which has a number of gfs2 fixes in it) would correct it. Apparently, it didn't. The bugzilla entry is #579801 -- scooter On 04/06/2010 08:13 AM, Paras pradhan wrote: > Curious to know. > > > Is this the issue with this particular version of kernel ie kernel > 2.6.18-194.el5 or part of new gfs packages too? Reboot to the older > kernel works in this case? > > Paras. > > > On Tue, Apr 6, 2010 at 4:44 AM, Steven Whitehouse > wrote: > > Hi, > > Can you open a bugzilla about this? Thanks, > > Steve. > > On Sun, 2010-04-04 at 13:58 -0700, Scooter Morris wrote: > > Hi all, > > We recently upgraded to 5.5 (kernel 2.6.18-194.el5) to get > some of > > the gfs2 fixes on a 3 node cluster, but crashed two days later > with the > > following stack trace: > > > > [2010-04-04 10:28:48]Unable to handle kernel NULL pointer > dereference at > > 0000000000000078 RIP: ^M > > [2010-04-04 10:28:48] [] > :gfs2:revoke_lo_add+0x1a/0x32^M > > [2010-04-04 10:28:48]PGD 7d4297067 PUD 13a24c067 PMD 0 ^M > > [2010-04-04 10:28:48]Oops: 0002 [1] SMP ^M > > [2010-04-04 10:28:48]last sysfs file: > > > /devices/pci0000:00/0000:00:01.0/0000:03:00.0/0000:04:01.0/0000:07:00.0/0000:08:00.0/irq^M > > [2010-04-04 10:28:48]CPU 8 ^M > > [2010-04-04 10:28:48]Modules linked in: ipt_MASQUERADE iptable_nat > > ip_nat bridge autofs4 hidp l2cap bluetooth lock_dlm gfs2 dlm > configfs > > lockd sunrpc ip_conntrack_netbios_ns xt_state ip_conntrack nfnetlink > > xt_tcpudp ipt_REJECT iptable_filter ip_tables arpt_mangle > > arptable_filter arp_tables x_tables ib_iser libiscsi2 > > scsi_transport_iscsi2 scsi_transport_iscsi ib_srp rds ib_sdp > ib_ipoib > > ipoib_helper ipv6 xfrm_nalgo crypto_api rdma_ucm rdma_cm ib_ucm > > ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa ib_mad ib_core > > dm_round_robin dm_multipath scsi_dh video backlight sbs power_meter > > hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi > > acpi_memhotplug ac parport_pc lp parport sg ide_cd bnx2 cdrom hpilo > > serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache > > dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx > scsi_transport_fc > > ata_piix libata shpchp cciss sd_mod scsi_mod ext3 jbd uhci_hcd > ohci_hcd > > ehci_hcd^M > > [2010-04-04 10:28:49]Pid: 795, comm: kswapd0 Not tainted > 2.6.18-194.el5 #1^M > > [2010-04-04 10:28:49]RIP: 0010:[] > > [] :gfs2:revoke_lo_add+0x1a/0x32^M > > [2010-04-04 10:28:49]RSP: 0018:ffff81082efcdae8 EFLAGS: 00010282^M > > [2010-04-04 10:28:49]RAX: 0000000000000000 RBX: ffff810256e037f0 > RCX: > > ffff8100207fd180^M > > [2010-04-04 10:28:49]RDX: ffff81051abdf630 RSI: ffff810819032720 > RDI: > > ffff810819032000^M > > [2010-04-04 10:28:49]RBP: ffff81051abdf610 R08: ffff81011cb31b06 > R09: > > ffff81082efcdb20^M > > [2010-04-04 10:28:49]R10: ffff8101135d8330 R11: ffffffff887dc3b9 > R12: > > ffff810819032000^M > > [2010-04-04 10:28:49]R13: 0000000000000000 R14: ffff810256e037f0 > R15: > > ffff810819032000^M > > [2010-04-04 10:28:50]FS: 0000000000000000(0000) > > GS:ffff81011cb319c0(0000) knlGS:0000000000000000^M > > [2010-04-04 10:28:50]CS: 0010 DS: 0018 ES: 0018 CR0: > 000000008005003b^M > > [2010-04-04 10:28:50]CR2: 0000000000000078 CR3: 000000024ed62000 > CR4: > > 00000000000006e0^M > > [2010-04-04 10:28:50]Process kswapd0 (pid: 795, threadinfo > > ffff81082efcc000, task ffff81082f5d17a0)^M > > [2010-04-04 10:28:50]Stack: ffffffff887dd88c 000000002efcde10 > > ffff810256e037f0 ffff81011c7fadd8^M > > [2010-04-04 10:28:50] 0000000000000000 0000000000000000 > ffffffff887deaf6 > > 000000000000000e^M > > [2010-04-04 10:28:50] ffff81011c7fadd8 00000000000000b0 > ffff81082efcdcf0 > > ffff810819032000^M > > [2010-04-04 10:28:50]Call Trace:^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_remove_from_journal+0x11f/0x131^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_invalidatepage+0xea/0x151^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_writepage_common+0x95/0xb1^M > > [2010-04-04 10:28:50] [] > > :gfs2:gfs2_jdata_writepage+0x56/0x98^M > > [2010-04-04 10:28:50] [] > > shrink_inactive_list+0x3fd/0x8d8^M > > [2010-04-04 10:28:50] [] > __pagevec_release+0x19/0x22^M > > [2010-04-04 10:28:51] [] > shrink_active_list+0x4b4/0x4c4^M > > [2010-04-04 10:28:51] [] shrink_zone+0x127/0x18d^M > > [2010-04-04 10:28:51] [] kswapd+0x323/0x46c^M > > [2010-04-04 10:28:51] [] > > autoremove_wake_function+0x0/0x2e^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kswapd+0x0/0x46c^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kthread+0xfe/0x132^M > > [2010-04-04 10:28:51] [] > request_module+0x0/0x14d^M > > [2010-04-04 10:28:51] [] child_rip+0xa/0x11^M > > [2010-04-04 10:28:51] [] > keventd_create_kthread+0x0/0xc4^M > > [2010-04-04 10:28:51] [] kthread+0x0/0x132^M > > [2010-04-04 10:28:51] [] child_rip+0x0/0x11^M > > > > This looks exactly like bug 437803, but that was closed early last > > year. Does anyone have any ideas what might be going on? I > > double-checked, and we definitely do not have the old kmod-gfs2 > installed. > > > > -- scooter > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From bernardchew at gmail.com Wed Apr 7 10:52:51 2010 From: bernardchew at gmail.com (Bernard Chew) Date: Wed, 7 Apr 2010 18:52:51 +0800 Subject: [Linux-cluster] "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" in /var/log/messages Message-ID: Hi all, I noticed "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" repeated every few hours in /var/log/messages. What does the message mean and is it normal? Will this cause fencing to take place eventually? Thank you in advance. Regards, Bernard Chew From ccaulfie at redhat.com Wed Apr 7 15:18:28 2010 From: ccaulfie at redhat.com (Christine Caulfield) Date: Wed, 07 Apr 2010 16:18:28 +0100 Subject: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 In-Reply-To: <036B68E61A28CA49AC2767596576CD596906D8E186@GVW1113EXC.americas.hpqcorp.net> References: <036B68E61A28CA49AC2767596576CD596906CE679D@GVW1113EXC.americas.hpqcorp.net> <4BBAE57F.2000601@redhat.com> <036B68E61A28CA49AC2767596576CD596906D8E186@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4BBCA244.20405@redhat.com> On 07/04/10 04:33, Jankowski, Chris wrote: > Chrissie, > > Thank you for the explanation. > > With the expected_nodes, I have the parameter specified explicitly in cluster.conf. The value is 3 and the cluster has 2 nodes and a quorum disk: > > > > It is still not listed by the openais-confdb-display command. > > Is is how it should be? > Where in cluster.conf do you have that statement ? Can you check that "cman_tool status" is showing the right value too ? it might be worth attaching your cluster.conf file. Chrissie > > -----Original Message----- > From: Christine Caulfield [mailto:ccaulfie at redhat.com] > Sent: Tuesday, 6 April 2010 17:41 > To: linux clustering > Cc: Jankowski, Chris > Subject: Re: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 > > On 02/04/10 12:20, Jankowski, Chris wrote: >> Hi, >> >> As per Red Hat Knowledgebase note 18886 on RHEL 5.4 I should be able to get the current in-memory values of the openAIS paramemters by running the following commands: >> >> # openais-confdb-display >> totem.version = '2' >> totem.secauth = '1' >> # openais-confdb-display totem token >> totem.token = '10000' >> # openais-confdb-display totem consensustotem.consensus = '4800' >> # openais-confdb-display totem token_retransmits_before_loss_const >> totem.token_retransmits_before_loss_const = '20' >> # openais-confdb-display cman quorum_dev_poll cman.quorum_dev_poll = >> '40000' >> # openais-confdb-display cman expected_votes cman.expected_votes = '3' >> # openais-confdb-display cman two_node cman.two_node = '1' >> >> On my 5.4 cluster it works for the first 4 commands, but for the last 3 commands I get, respectively: >> >> Could not get "quorum _dev_poll" :1 >> Could not get "expected_votes" :1 >> Could not get "two_node" :1 >> >> Anybody knows what is going on here? >> > > It means that those values are not in the configuration object database because they aren't in cluster.conf. From that you can infer that the defaults are in place. ie: > > quorum_dev_poll = 10000 > expected_votes = two_node = 0 > > Chrissie From lhh at redhat.com Wed Apr 7 15:38:09 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Apr 2010 11:38:09 -0400 Subject: [Linux-cluster] why does ip.sh launch rdisc ? In-Reply-To: <819385.7710.qm@web111711.mail.gq1.yahoo.com> References: <819385.7710.qm@web111711.mail.gq1.yahoo.com> Message-ID: <1270654689.27198.12.camel@localhost.localdomain> On Mon, 2010-04-05 at 04:01 -0700, Celso K. Webber wrote: > I'd rather prefer the rdisc invocation would be an option in cluster.conf, not a default behaviour. > Please file a bug and we can do this. -- Lon From lhh at redhat.com Wed Apr 7 15:39:18 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Apr 2010 11:39:18 -0400 Subject: [Linux-cluster] rgmanager and clvm don't work after reboot In-Reply-To: <4BBA3878.3020805@srce.hr> References: <4BA92052.5020806@srce.hr> <4BA94112.7010005@srce.hr> <4BAA2A1C.7020106@srce.hr> <4BB7EEEA.1020706@srce.hr> <4BBA3878.3020805@srce.hr> Message-ID: <1270654758.27198.13.camel@localhost.localdomain> On Mon, 2010-04-05 at 21:22 +0200, Jakov Sosic wrote: > > So no opinions on this one? I still have a locked and unuseable cluster > > which I have to reboot because of this :( And the worst part is problem > > is reproducable - it's enough to leave it a couple of days on, and then > > reboot a single node... > > > > > I see a following error on TTY (alt+ctrl+f1): > > dlm: clvmd: group join failed -512 0 > > If this helps... > > Maybe I should post this to a linux-cluster devel list? To start, you'll want to get a 'group_tool ls' or 'cman_tool services' and see if anything's blocked. -- Lon From lhh at redhat.com Wed Apr 7 15:40:58 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Apr 2010 11:40:58 -0400 Subject: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster In-Reply-To: References: Message-ID: <1270654858.27198.14.camel@localhost.localdomain> On Thu, 2010-04-01 at 05:00 +0000, Joseph L. Casale wrote: > >How can I restart only one of the resources and do not let the cluster detect the failure? We need to restart only one of the resources but not restart all the resources. > >So restart the who resource group doesn't work for us. > > $ info clusvcadm > > Look for the -Z option, it'll freeze it on the member and prevent status checks. > > Don't forget to unfreeze:) Actually, I recently added that to the wiki here: http://sources.redhat.com/cluster/wiki/ServiceFreeze -- Lon From alfredo.moralejo at roche.com Wed Apr 7 16:16:06 2010 From: alfredo.moralejo at roche.com (Moralejo, Alfredo) Date: Wed, 7 Apr 2010 18:16:06 +0200 Subject: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster In-Reply-To: <1270654858.27198.14.camel@localhost.localdomain> References: <1270654858.27198.14.camel@localhost.localdomain> Message-ID: I think rg_test command can be used to start/stop a specific resource without affecting the entire service. i.e.: rg_test test /etc/cluster/cluster.conf start fs export will start the resource of type fs named export. Note that although command is "test" it actually starts the resource. Additionaly the freeze option is very usefull to avoid undesirable behaviors when playing with services and resources. Regards, Alfredo -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Lon Hohberger Sent: Wednesday, April 07, 2010 5:41 PM To: linux clustering Subject: Re: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster On Thu, 2010-04-01 at 05:00 +0000, Joseph L. Casale wrote: > >How can I restart only one of the resources and do not let the cluster detect the failure? We need to restart only one of the resources but not restart all the resources. > >So restart the who resource group doesn't work for us. > > $ info clusvcadm > > Look for the -Z option, it'll freeze it on the member and prevent status checks. > > Don't forget to unfreeze:) Actually, I recently added that to the wiki here: http://sources.redhat.com/cluster/wiki/ServiceFreeze -- Lon -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From sdake at redhat.com Wed Apr 7 16:58:03 2010 From: sdake at redhat.com (Steven Dake) Date: Wed, 07 Apr 2010 09:58:03 -0700 Subject: [Linux-cluster] "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" in /var/log/messages In-Reply-To: References: Message-ID: <1270659483.2550.809.camel@localhost.localdomain> On Wed, 2010-04-07 at 18:52 +0800, Bernard Chew wrote: > Hi all, > > I noticed "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" repeated > every few hours in /var/log/messages. What does the message mean and > is it normal? Will this cause fencing to take place eventually? > This means your network environment dropped packets and totem is recovering them. This is normal operation, and in future versions such as corosync no notification is printed when recovery takes place. There is a bug, however, fixed in revision 2122 where if the last packet in the order is lost, and no new packets are unlost after it, the processor will enter a failed to receive state and trigger fencing. Regards -steve > Thank you in advance. > > Regards, > Bernard Chew > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From lhh at redhat.com Wed Apr 7 19:57:24 2010 From: lhh at redhat.com (Lon Hohberger) Date: Wed, 07 Apr 2010 15:57:24 -0400 Subject: [Linux-cluster] How to restart one resouce in cluster manually but not detected by cluster In-Reply-To: References: <1270654858.27198.14.camel@localhost.localdomain> Message-ID: <1270670244.27198.16.camel@localhost.localdomain> On Wed, 2010-04-07 at 18:16 +0200, Moralejo, Alfredo wrote: > I think rg_test command can be used to start/stop a specific resource without affecting the entire service. i.e.: > > rg_test test /etc/cluster/cluster.conf start fs export > > will start the resource of type fs named export. Note that although command is "test" it actually starts the resource. > > Additionaly the freeze option is very usefull to avoid undesirable behaviors when playing with services and resources. That's also correct. -- Lon From bergman at merctech.com Wed Apr 7 23:27:25 2010 From: bergman at merctech.com (bergman at merctech.com) Date: Wed, 07 Apr 2010 19:27:25 -0400 Subject: [Linux-cluster] use of netgroup as NFS client resource target? Message-ID: <3314.1270682845@localhost> Is it possible to use netgroups to define NFS shares, rather than per-client? Specifically, when creating an resource of type "NFS Client", can the "target" be a netgroup instead of the name of a single client? On a conventional NFS server, /etc/exports might look something like: /homedirs @servers(rw) /homedirs @desktops(rw) /bindir @servers(ro) /logdir @servers(rw) Under RHCS, it seems as if each member of the netgroups (@servers, @desktops) must be defined as a separate NFS client and set up as a separate resource. This doesn't scale well in environments with significant numbers of NFS clients, particularly if they share subnets with machines that should not have NFS access to the shares. Environment: CentOS 5.4 (2.6.18-164.15.1.el5) RHCS (cman) 2.0.115-1.el5_4.9 Thanks, Mark From Chris.Jankowski at hp.com Thu Apr 8 00:37:57 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Thu, 8 Apr 2010 00:37:57 +0000 Subject: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 In-Reply-To: <4BBCA244.20405@redhat.com> References: <036B68E61A28CA49AC2767596576CD596906CE679D@GVW1113EXC.americas.hpqcorp.net> <4BBAE57F.2000601@redhat.com> <036B68E61A28CA49AC2767596576CD596906D8E186@GVW1113EXC.americas.hpqcorp.net> <4BBCA244.20405@redhat.com> Message-ID: <036B68E61A28CA49AC2767596576CD596906D8E642@GVW1113EXC.americas.hpqcorp.net> Chrissie, The cluster.conf is attached. I cannot test anything on the cluster anymore, as I no longer have access to it. I am out of this site and on another project. Thanks and regards, Chris -----Original Message----- From: Christine Caulfield [mailto:ccaulfie at redhat.com] Sent: Thursday, 8 April 2010 01:18 To: Jankowski, Chris Cc: linux clustering Subject: Re: [Linux-cluster] Listing openAIS parameters on RHEL Cluster Suite 5 On 07/04/10 04:33, Jankowski, Chris wrote: > Chrissie, > > Thank you for the explanation. > > With the expected_nodes, I have the parameter specified explicitly in cluster.conf. The value is 3 and the cluster has 2 nodes and a quorum disk: > > > > It is still not listed by the openais-confdb-display command. > > Is is how it should be? > Where in cluster.conf do you have that statement ? Can you check that "cman_tool status" is showing the right value too ? it might be worth attaching your cluster.conf file. Chrissie > > -----Original Message----- > From: Christine Caulfield [mailto:ccaulfie at redhat.com] > Sent: Tuesday, 6 April 2010 17:41 > To: linux clustering > Cc: Jankowski, Chris > Subject: Re: [Linux-cluster] Listing openAIS parameters on RHEL > Cluster Suite 5 > > On 02/04/10 12:20, Jankowski, Chris wrote: >> Hi, >> >> As per Red Hat Knowledgebase note 18886 on RHEL 5.4 I should be able to get the current in-memory values of the openAIS paramemters by running the following commands: >> >> # openais-confdb-display >> totem.version = '2' >> totem.secauth = '1' >> # openais-confdb-display totem token >> totem.token = '10000' >> # openais-confdb-display totem consensustotem.consensus = '4800' >> # openais-confdb-display totem token_retransmits_before_loss_const >> totem.token_retransmits_before_loss_const = '20' >> # openais-confdb-display cman quorum_dev_poll cman.quorum_dev_poll = >> '40000' >> # openais-confdb-display cman expected_votes cman.expected_votes = '3' >> # openais-confdb-display cman two_node cman.two_node = '1' >> >> On my 5.4 cluster it works for the first 4 commands, but for the last 3 commands I get, respectively: >> >> Could not get "quorum _dev_poll" :1 >> Could not get "expected_votes" :1 >> Could not get "two_node" :1 >> >> Anybody knows what is going on here? >> > > It means that those values are not in the configuration object database because they aren't in cluster.conf. From that you can infer that the defaults are in place. ie: > > quorum_dev_poll = 10000 > expected_votes = two_node = 0 > > Chrissie -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1674 bytes Desc: cluster.conf URL: From jakov.sosic at srce.hr Thu Apr 8 08:28:55 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Thu, 08 Apr 2010 10:28:55 +0200 Subject: [Linux-cluster] rgmanager and clvm don't work after reboot In-Reply-To: <1270654758.27198.13.camel@localhost.localdomain> References: <4BA92052.5020806@srce.hr> <4BA94112.7010005@srce.hr> <4BAA2A1C.7020106@srce.hr> <4BB7EEEA.1020706@srce.hr> <4BBA3878.3020805@srce.hr> <1270654758.27198.13.camel@localhost.localdomain> Message-ID: <4BBD93C7.1050907@srce.hr> On 04/07/2010 05:39 PM, Lon Hohberger wrote: > On Mon, 2010-04-05 at 21:22 +0200, Jakov Sosic wrote: > >>> So no opinions on this one? I still have a locked and unuseable cluster >>> which I have to reboot because of this :( And the worst part is problem >>> is reproducable - it's enough to leave it a couple of days on, and then >>> reboot a single node... >>> >> >> >> I see a following error on TTY (alt+ctrl+f1): >> >> dlm: clvmd: group join failed -512 0 >> >> If this helps... >> >> Maybe I should post this to a linux-cluster devel list? > > To start, you'll want to get a 'group_tool ls' or 'cman_tool services' > and see if anything's blocked. thanx ! i'll try it next time i reboot a node and report it back here.... -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From matthias at aic.at Thu Apr 8 11:45:33 2010 From: matthias at aic.at (Matthias Leopold) Date: Thu, 08 Apr 2010 13:45:33 +0200 Subject: [Linux-cluster] joining an "old" cluster Message-ID: <4BBDC1DD.3040005@aic.at> hi, i inherited a 3 node cluster which was built "by hand" around 4 years ago. unfortunately its nearly impossible to update any cluster software on the nodes. now i want to add a new node using a stock RHEL release. in redhat cluster faq it says: "You can't mix RHEL4 U1 and U2 systems in a cluster because there were changes between U1 and U2 that changed the format of internal messages that are sent around the cluster." what are these changes? how do i find out which RHEL4 version to use for a new node in my cluster? what i know about the existing nodes: cman_tool -V gives 1.0.0. cman, dlm and gfs where built on 2006-01-19 from cvs OS is Fedora Core release 4 kernel is 2.6.14 thx for advice -- Mit freundlichen Gr?ssen Matthias Leopold System & Network Administration Streams Telecommunications GmbH Universitaetsstrasse 10/7, 1090 Vienna, Austria tel: +43 1 40159113 fax: +43 1 40159300 ------------------------------------------------ From ricardo at fedoraproject.org Fri Apr 9 05:02:37 2010 From: ricardo at fedoraproject.org (=?UTF-8?Q?Ricardo_Arg=C3=BCello?=) Date: Fri, 9 Apr 2010 00:02:37 -0500 Subject: [Linux-cluster] GFS2 and D state HTTPD processes In-Reply-To: References: <1267520814.3405.2.camel@localhost> Message-ID: Looks like this bug: GFS2 - probably lost glock call back https://bugzilla.redhat.com/show_bug.cgi?id=498976 This is fixed in the kernel included in RHEL 5.5. Do a "yum update" to fix it. Ricardo Arguello On Tue, Mar 2, 2010 at 6:10 AM, Emilio Arjona wrote: > Thanks for your response, Steve. > > 2010/3/2 Steven Whitehouse : >> Hi, >> >> On Fri, 2010-02-26 at 16:52 +0100, Emilio Arjona wrote: >>> Hi, >>> >>> we are experiencing some problems commented in an old thread: >>> >>> http://www.mail-archive.com/linux-cluster at redhat.com/msg07091.html >>> >>> We have 3 clustered servers under Red Hat 5.4 accessing a GFS2 resource. >>> >>> fstab options: >>> /dev/vg_cluster/lv_cluster /opt/datacluster gfs2 >>> defaults,noatime,nodiratime,noquota 0 0 >>> >>> GFS options: >>> plock_rate_limit="0" >>> plock_ownership=1 >>> >>> httpd processes run into D status sometimes and the only solution is >>> hard reset the affected server. >>> >>> Can anyone give me some hints to diagnose the problem? >>> >>> Thanks :) >>> >> Can you give me a rough idea of what the actual workload is and how it >> is distributed amoung the director(y/ies) ? > > We had problems with php sessions in the past but we fixed it by > configuring php to store the sessions in the database instead of in > the GFS filesystem. Now, we're having problems with files and > directories in the "data" folder of Moodle LMS. > > "lsof -p" returned a i/o operation over the same folder in 2/3 nodes, > we did a hard reset of these nodes but some hours after the CPU load > grew up again, specially in the node that wasn't rebooted. We decided > to reboot (v?a ssh) this node, then the CPU load went down to normal > values in all nodes. > > I don't think the system's load is high enough to produce concurrent > access problems. It's more likely to be some misconfiguration, in > fact, we changed some GFS2 options to non default values to increase > performance (http://www.linuxdynasty.org/howto-increase-gfs2-performance-in-a-cluster.html). > >> >> This is often down to contention on glocks (one per inode) and maybe >> because there is a process of processes writing a file or directory >> which is in use (either read-only or writable) by other processes. >> >> If you are using php, then you might have to strace it to find out what >> it is really doing, > > Ok, we will try to strace the D processes and post the results. Hope > we find something!! > >> >> Steve. >> >>> -- >>> >>> Emilio Arjona. >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Emilio Arjona. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From matthias at aic.at Fri Apr 9 08:28:34 2010 From: matthias at aic.at (Matthias Leopold) Date: Fri, 09 Apr 2010 10:28:34 +0200 Subject: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: <4BBEE532.2090408@aic.at> hi, is it possible to mount a gfs volume readonly from outside the cluster (while the cluster is up and all nodes do I/O)? -- Mit freundlichen Gr?ssen Matthias Leopold System & Network Administration Streams Telecommunications GmbH Universitaetsstrasse 10/7, 1090 Vienna, Austria tel: +43 1 40159113 fax: +43 1 40159300 ------------------------------------------------ From bernardchew at gmail.com Fri Apr 9 08:51:52 2010 From: bernardchew at gmail.com (Bernard Chew) Date: Fri, 9 Apr 2010 16:51:52 +0800 Subject: [Linux-cluster] "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" in /var/log/messages In-Reply-To: <1270659483.2550.809.camel@localhost.localdomain> References: <1270659483.2550.809.camel@localhost.localdomain> Message-ID: > On Thu, Apr 8, 2010 at 12:58 AM, Steven Dake wrote: > On Wed, 2010-04-07 at 18:52 +0800, Bernard Chew wrote: >> Hi all, >> >> I noticed "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" repeated >> every few hours in /var/log/messages. What does the message mean and >> is it normal? Will this cause fencing to take place eventually? >> > This means your network environment dropped packets and totem is > recovering them. ?This is normal operation, and in future versions such > as corosync no notification is printed when recovery takes place. > > There is a bug, however, fixed in revision 2122 where if the last packet > in the order is lost, and no new packets are unlost after it, the > processor will enter a failed to receive state and trigger fencing. > > Regards > -steve >> Thank you in advance. >> >> Regards, >> Bernard Chew >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Thank you for the reply Steve! The cluster was running fine until last week where 3 nodes restarted suddenly. I suspect fencing took place since all 3 servers restarted at the same time but I couldn't find any fence related entries in the log. I am guessing we hit the bug you mentioned? Will the log indicate fencing has taken place with regards to the bug you mentioned? Also I noticed the message "kernel: clustat[28328]: segfault at 0000000000000024 rip 0000003b31c75bc0 rsp 00007fff955cb098 error 4" occasionally; is this related to the TOTEM message or they indicate another problem? Regards, Bernard Chew From swhiteho at redhat.com Fri Apr 9 09:01:51 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 09 Apr 2010 10:01:51 +0100 Subject: [Linux-cluster] mounting gfs volumes outside a cluster? In-Reply-To: <4BBEE532.2090408@aic.at> References: <4BBEE532.2090408@aic.at> Message-ID: <1270803711.2753.4.camel@localhost> Hi, On Fri, 2010-04-09 at 10:28 +0200, Matthias Leopold wrote: > hi, > > is it possible to mount a gfs volume readonly from outside the cluster > (while the cluster is up and all nodes do I/O)? > No. There is a "spectator" mount where a read only node can mount without having a journal assigned to it, but it must still be part of the cluster, Steve. From bernardchew at gmail.com Fri Apr 9 09:04:33 2010 From: bernardchew at gmail.com (Bernard Chew) Date: Fri, 9 Apr 2010 17:04:33 +0800 Subject: [Linux-cluster] mounting gfs volumes outside a cluster? In-Reply-To: <4BBEE532.2090408@aic.at> References: <4BBEE532.2090408@aic.at> Message-ID: > On Fri, Apr 9, 2010 at 4:28 PM, Matthias Leopold wrote: > hi, > > is it possible to mount a gfs volume readonly from outside the cluster > (while the cluster is up and all nodes do I/O)? > > -- > Mit freundlichen Gr?ssen > > Matthias Leopold > System & Network Administration > > Streams Telecommunications GmbH > Universitaetsstrasse 10/7, 1090 Vienna, Austria > > tel: +43 1 40159113 > fax: +43 1 40159300 > ------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Hi Matthias, I am not the expert here but how about exporting the GFS volume using NFS? Regards, Bernard From matthias at aic.at Fri Apr 9 10:31:55 2010 From: matthias at aic.at (Matthias Leopold) Date: Fri, 09 Apr 2010 12:31:55 +0200 Subject: [Linux-cluster] mounting gfs volumes outside a cluster? In-Reply-To: References: <4BBEE532.2090408@aic.at> Message-ID: <4BBF021B.3090205@aic.at> Bernard Chew schrieb: >> On Fri, Apr 9, 2010 at 4:28 PM, Matthias Leopold wrote: >> hi, >> >> is it possible to mount a gfs volume readonly from outside the cluster >> (while the cluster is up and all nodes do I/O)? >> >> -- >> Mit freundlichen Gr?ssen >> >> Matthias Leopold >> System & Network Administration >> >> Streams Telecommunications GmbH >> Universitaetsstrasse 10/7, 1090 Vienna, Austria >> >> tel: +43 1 40159113 >> fax: +43 1 40159300 >> ------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > Hi Matthias, > > I am not the expert here but how about exporting the GFS volume using NFS? > > Regards, > Bernard > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster that's an nice idea,thx at second glance this indeed seems to be a viable solution regards, matthias From frank at si.ct.upc.edu Fri Apr 9 11:39:21 2010 From: frank at si.ct.upc.edu (frank) Date: Fri, 09 Apr 2010 13:39:21 +0200 Subject: [Linux-cluster] Clustering and Cluster-Storage channels Message-ID: <4BBF11E9.803@si.ct.upc.edu> Hi, we have several machines with RH 5.4 and we use Cluster and Cluster-Storage (because we use GFS). We also have troubles in updates because we don't know how to subscribe out machines to that channels. From RHN, in "Software Channel Subscriptions" part, we see: Release Channels for Red Hat Enterprise Linux 5 for x86_64 RHEL FasTrack (v. 5 for 64-bit x86_64) (Channel Details) Consumes a regular entitlement (13 available) RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) (Channel Details) Consumes a regular entitlement (13 available) RHEL Supplementary (v. 5 for 64-bit x86_64) (Channel Details) Consumes a regular entitlement (13 available) RHEL Virtualization (v. 5 for 64-bit x86_64) (Channel Details) Consumes a regular entitlement (9 available) Red Hat Network Tools for RHEL Server (v.5 64-bit x86_64) (Channel Details) Consumes a regular entitlement (10 available) BETA Channels for Red Hat Enterprise Linux 5 for x86_64 RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (13 available) RHEL Supplementary (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (13 available) RHEL Virtualization (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (13 available) Red Hat Enterprise Linux (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (13 available) Additional Services Channels for Red Hat Enterprise Linux 5 for x86_64 RHEL Hardware Certification (v. 5 for 64-bit x86_64) (Channel Details) Consumes a regular entitlement (13 available) Additional Services BETA Channels for Red Hat Enterprise Linux 5 for x86_64 RHEL Cluster-Storage (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (10 available) RHEL Clustering (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (10 available) RHEL Hardware Certification (v. 5 for 64-bit x86_64) Beta (Channel Details) Consumes a regular entitlement (13 available) Thera are cluster beta channels, but not the release ones. How can we subscribe systems to them? Thanks and Regards. Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est? net. From fdinitto at redhat.com Fri Apr 9 12:05:27 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Fri, 09 Apr 2010 14:05:27 +0200 Subject: [Linux-cluster] Cluster 3.0.10 stable release Message-ID: <4BBF1807.6030004@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The cluster team and its community are proud to announce the 3.0.10 stable release from the STABLE3 branch. This release contains a few major bug fixes. We strongly recommend people to update their clusters. In order to build/run the 3.0.10 release you will need: - - corosync 1.2.1 - - openais 1.1.2 - - linux kernel 2.6.31 (only for GFS1 users) The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.10.tar.bz2 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.0.9): Abhijith Das (5): gfs2_quota: Fix gfs2_quota to handle boundary conditions gfs2_convert: gfs2_convert segfaults when converting filesystems of blocksize 512 bytes gfs2_convert: gfs2_convert uses too much memory for jdata conversion gfs2_convert: Fix conversion of gfs1 CDPNs gfs2_convert: Doesn't convert indirectly-pointed extended attributes correctly Bob Peterson (3): gfs2: GFS2 utilities should make use of exported device topology cman: gfs_controld dm suspend hangs withdrawn GFS file system GFS2: fsck.gfs2 segfault - osi_tree "each_safe" patch Christine Caulfield (3): cman: Add improved cluster_id hash function cman: move fnv hash function into its own file ccs: Remove non-existant commands from ccs_tool man page. David Teigland (7): dlm_controld/libdlmcontrol/dlm_tool: separate plock debug buffer dlm_controld: add more fs_notified debugging dlm_controld/gfs_controld: avoid full plock unlock when no resource exists dlm_controld: add plock checkpoint signatures dlm_controld: set last_plock_time for ownership operations dlm_controld: don't skip unlinking checkpoint gfs_controld: set last_plock_time for ownership operations Fabio M. Di Nitto (1): dlm: bump libdlmcontrol sominor Jan Friesse (1): fencing: SNMP fence agents don't fail Lon Hohberger (4): config: Add hash_cluster_id to schema rgmanager: Fix 2+ simultaneous relocation crash rgmanager: Fix memory leaks during relocation rgmanager: Fix tiny memory leak during reconfig Marek 'marx' Grac (1): fencing: Remove 'ipport' option from WTI fence agent cman/daemon/Makefile | 3 +- cman/daemon/cman-preconfig.c | 32 +++- cman/daemon/fnvhash.c | 93 +++++++++ cman/daemon/fnvhash.h | 1 + config/plugins/ldap/99cluster.ldif | 10 +- config/plugins/ldap/ldap-base.csv | 3 +- config/tools/man/ccs_tool.8 | 15 +-- config/tools/xml/cluster.rng.in | 3 + dlm/libdlmcontrol/Makefile | 2 + dlm/libdlmcontrol/libdlmcontrol.h | 1 + dlm/libdlmcontrol/main.c | 5 + dlm/man/dlm_tool.8 | 4 + dlm/tool/main.c | 25 +++- doc/COPYRIGHT | 6 + fence/agents/lib/fencing_snmp.py.py | 13 +- fence/agents/wti/fence_wti.py | 2 +- gfs2/convert/gfs2_convert.c | 377 ++++++++++++++++++++++++++++------ gfs2/fsck/link.c | 2 - gfs2/fsck/pass1b.c | 10 +- gfs2/fsck/pass3.c | 5 +- gfs2/fsck/pass4.c | 5 +- gfs2/libgfs2/device_geometry.c | 38 ++++- gfs2/libgfs2/fs_ops.c | 1 + gfs2/libgfs2/libgfs2.h | 10 + gfs2/mkfs/main_mkfs.c | 69 ++++++- gfs2/quota/check.c | 3 +- gfs2/quota/main.c | 104 +++------- group/dlm_controld/cpg.c | 62 ++++-- group/dlm_controld/dlm_controld.h | 1 + group/dlm_controld/dlm_daemon.h | 44 +++- group/dlm_controld/main.c | 60 +++++- group/dlm_controld/plock.c | 241 ++++++++++++++-------- group/gfs_controld/plock.c | 12 +- group/gfs_controld/util.c | 3 +- rgmanager/src/daemons/event_config.c | 2 + rgmanager/src/daemons/rg_state.c | 2 + rgmanager/src/daemons/rg_thread.c | 4 +- 37 files changed, 967 insertions(+), 306 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJLvxgDAAoJEFA6oBJjVJ+OOzUP/3Wl1ChlhmcjVvClhDyZhI4q aPSTnChG1b40WB7sh7UQsVcD0mwAsPPsgDaZZUlybhWl2LylxZ5xEwu7VWoL8SwJ 8Q4aYT1Svp6jFfvqdmoRFmJfjp+vc3y7Gllx3NP6kLmf62TbTROgbc3X++72IFkf 14DPEonWao2FzKx7MaoZCSttc0djuILd+UNh7EEgqC2lyR2r3tatmCa1i/eT2Pfy fwISqy4ioNie5i5SMO7fS9y4NCLnognMgeuH5iS5EJDUViougWyQSSorI8SQq36f ZRyrrUwuUivT2ylXyz3TgfuojGpRuFy2AC1oBxRsiDOVyMrVFHX4NaP5E18J4qs1 0acYMULOpZYcwgKaLMy6haiYWwfvjFvI71zs4mKijmsWvuPbGTyVx7yxDJJco8SM OQBF5holEHqOo4FVekFa6De0GUMjfgmpGhfPTtuw04/ww5pbNp84Y4TzEOsRA9dd H6ak9yLwN4chjyDWRQxHsDnxCf67oqYDZJL5t1QlMauxruGYdXU3xIZRC9E4oYbW +vu+DTbkMGg70xg2MbXH3E7EkGHeJ9EWgiuEh5l4pavrEo14rf80O0dtf+myn8t7 HosKmXjjdnjaVfYNimUH7/0mnISxX2YOO9uzBD6A/X9bqxrxC1Ky6TdI6tFN80dz nH3IJrLomvkmnadhFRqg =Z/pM -----END PGP SIGNATURE----- From cmaiolino at redhat.com Fri Apr 9 13:00:51 2010 From: cmaiolino at redhat.com (Carlos Maiolino) Date: Fri, 9 Apr 2010 10:00:51 -0300 Subject: [Linux-cluster] Clustering and Cluster-Storage channels In-Reply-To: <4BBF11E9.803@si.ct.upc.edu> References: <4BBF11E9.803@si.ct.upc.edu> Message-ID: <20100409130051.GA31186@andromeda.usersys.redhat.com> On Fri, Apr 09, 2010 at 01:39:21PM +0200, frank wrote: > Hi, > we have several machines with RH 5.4 and we use Cluster and > Cluster-Storage (because we use GFS). > We also have troubles in updates because we don't know how to > subscribe out machines to that channels. From RHN, in "Software > Channel Subscriptions" part, we see: > > Release Channels for Red Hat Enterprise Linux 5 for x86_64 > RHEL FasTrack (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (13 available) > RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) > (Channel Details) Consumes a regular entitlement (13 available) > RHEL Supplementary (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (13 available) > RHEL Virtualization (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (9 available) > Red Hat Network Tools for RHEL Server (v.5 64-bit x86_64) > (Channel Details) Consumes a regular entitlement (10 available) > BETA Channels for Red Hat Enterprise Linux 5 for x86_64 > RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) Beta > (Channel Details) Consumes a regular entitlement (13 available) > RHEL Supplementary (v. 5 for 64-bit x86_64) Beta (Channel > Details) Consumes a regular entitlement (13 available) > RHEL Virtualization (v. 5 for 64-bit x86_64) Beta (Channel > Details) Consumes a regular entitlement (13 available) > Red Hat Enterprise Linux (v. 5 for 64-bit x86_64) Beta (Channel > Details) Consumes a regular entitlement (13 available) > Additional Services Channels for Red Hat Enterprise Linux 5 for x86_64 > RHEL Hardware Certification (v. 5 for 64-bit x86_64) (Channel > Details) Consumes a regular entitlement (13 available) > Additional Services BETA Channels for Red Hat Enterprise Linux 5 for x86_64 > RHEL Cluster-Storage (v. 5 for 64-bit x86_64) Beta (Channel > Details) Consumes a regular entitlement (10 available) > RHEL Clustering (v. 5 for 64-bit x86_64) Beta (Channel Details) > Consumes a regular entitlement (10 available) > RHEL Hardware Certification (v. 5 for 64-bit x86_64) Beta > (Channel Details) Consumes a regular entitlement (13 available) > > Thera are cluster beta channels, but not the release ones. How can > we subscribe systems to them? > > Thanks and Regards. > > Frank > > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que est? net. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hello Frank. I guess is better you to contact Red Hat support, once this looks like a subscription problem than a thecnical problem. see you ;) -- --- Best Regards Carlos Eduardo Maiolino From j_w_usa at yahoo.com Fri Apr 9 17:58:12 2010 From: j_w_usa at yahoo.com (John Wong) Date: Fri, 9 Apr 2010 10:58:12 -0700 (PDT) Subject: [Linux-cluster] Linux-cluster Digest, Vol 72, Issue 9 In-Reply-To: Message-ID: <480555.92510.qm@web51905.mail.re2.yahoo.com> I have tried the following: ? nfs-exported?the gfs2 file system,?and a nfs client mounted it.?It worked. ? john ? ------------------------------ Message: 4 Date: Fri, 09 Apr 2010 10:01:51 +0100 From: Steven Whitehouse To: linux clustering Subject: Re: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: <1270803711.2753.4.camel at localhost> Content-Type: text/plain; charset="UTF-8" Hi, On Fri, 2010-04-09 at 10:28 +0200, Matthias Leopold wrote: > hi, > > is it possible to mount a gfs volume readonly from outside the cluster > (while the cluster is up and all nodes do I/O)? > No. There is a "spectator" mount where a read only node can mount without having a journal assigned to it, but it must still be part of the cluster, Steve. ------------------------------ --- On Fri, 4/9/10, linux-cluster-request at redhat.com wrote: From: linux-cluster-request at redhat.com Subject: Linux-cluster Digest, Vol 72, Issue 9 To: linux-cluster at redhat.com Date: Friday, April 9, 2010, 9:00 AM Send Linux-cluster mailing list submissions to ??? linux-cluster at redhat.com To subscribe or unsubscribe via the World Wide Web, visit ??? https://www.redhat.com/mailman/listinfo/linux-cluster or, via email, send a message with subject or body 'help' to ??? linux-cluster-request at redhat.com You can reach the person managing the list at ??? linux-cluster-owner at redhat.com When replying, please edit your Subject line so it is more specific than "Re: Contents of Linux-cluster digest..." Today's Topics: ???1. Re: GFS2 and D state HTTPD processes (Ricardo Arg?ello) ???2. mounting gfs volumes outside a cluster? (Matthias Leopold) ???3. Re: "openais[XXXX]" [TOTEM] Retransmit List: XXXXX"??? in ? ? ? /var/log/messages (Bernard Chew) ???4. Re: mounting gfs volumes outside a cluster? (Steven Whitehouse) ???5. Re: mounting gfs volumes outside a cluster? (Bernard Chew) ???6. Re: mounting gfs volumes outside a cluster? (Matthias Leopold) ???7. Clustering and Cluster-Storage channels (frank) ???8. Cluster 3.0.10 stable release (Fabio M. Di Nitto) ???9. Re: Clustering and Cluster-Storage channels (Carlos Maiolino) ---------------------------------------------------------------------- Message: 1 Date: Fri, 9 Apr 2010 00:02:37 -0500 From: Ricardo Arg?ello To: linux clustering Subject: Re: [Linux-cluster] GFS2 and D state HTTPD processes Message-ID: ??? Content-Type: text/plain; charset=UTF-8 Looks like this bug: GFS2 - probably lost glock call back https://bugzilla.redhat.com/show_bug.cgi?id=498976 This is fixed in the kernel included in RHEL 5.5. Do a "yum update" to fix it. Ricardo Arguello On Tue, Mar 2, 2010 at 6:10 AM, Emilio Arjona wrote: > Thanks for your response, Steve. > > 2010/3/2 Steven Whitehouse : >> Hi, >> >> On Fri, 2010-02-26 at 16:52 +0100, Emilio Arjona wrote: >>> Hi, >>> >>> we are experiencing some problems commented in an old thread: >>> >>> http://www.mail-archive.com/linux-cluster at redhat.com/msg07091.html >>> >>> We have 3 clustered servers under Red Hat 5.4 accessing a GFS2 resource. >>> >>> fstab options: >>> /dev/vg_cluster/lv_cluster /opt/datacluster gfs2 >>> defaults,noatime,nodiratime,noquota 0 0 >>> >>> GFS options: >>> plock_rate_limit="0" >>> plock_ownership=1 >>> >>> httpd processes run into D status sometimes and the only solution is >>> hard reset the affected server. >>> >>> Can anyone give me some hints to diagnose the problem? >>> >>> Thanks :) >>> >> Can you give me a rough idea of what the actual workload is and how it >> is distributed amoung the director(y/ies) ? > > We had problems with php sessions in the past but we fixed it by > configuring php to store the sessions in the database instead of in > the GFS filesystem. Now, we're having problems with files and > directories in the "data" folder of Moodle LMS. > > "lsof -p" returned a i/o operation over the same folder in 2/3 nodes, > we did a hard reset of these nodes but some hours after the CPU load > grew up again, specially in the node that wasn't rebooted. We decided > to reboot (v?a ssh) this node, then the CPU load went down to normal > values in all nodes. > > I don't think the system's load is high enough to produce concurrent > access problems. It's more likely to be some misconfiguration, in > fact, we changed some GFS2 options to non default values to increase > performance (http://www.linuxdynasty.org/howto-increase-gfs2-performance-in-a-cluster.html). > >> >> This is often down to contention on glocks (one per inode) and maybe >> because there is a process of processes writing a file or directory >> which is in use (either read-only or writable) by other processes. >> >> If you are using php, then you might have to strace it to find out what >> it is really doing, > > Ok, we will try to strace the D processes and post the results. Hope > we find something!! > >> >> Steve. >> >>> -- >>> >>> Emilio Arjona. >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > > -- > Emilio Arjona. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ------------------------------ Message: 2 Date: Fri, 09 Apr 2010 10:28:34 +0200 From: Matthias Leopold To: linux clustering Subject: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: <4BBEE532.2090408 at aic.at> Content-Type: text/plain; charset=ISO-8859-15 hi, is it possible to mount a gfs volume readonly from outside the cluster (while the cluster is up and all nodes do I/O)? -- Mit freundlichen Gr?ssen Matthias Leopold System & Network Administration Streams Telecommunications GmbH Universitaetsstrasse 10/7, 1090 Vienna, Austria tel: +43 1 40159113 fax: +43 1 40159300 ------------------------------------------------ ------------------------------ Message: 3 Date: Fri, 9 Apr 2010 16:51:52 +0800 From: Bernard Chew To: sdake at redhat.com, linux clustering Subject: Re: [Linux-cluster] "openais[XXXX]" [TOTEM] Retransmit List: ??? XXXXX"??? in /var/log/messages Message-ID: ??? Content-Type: text/plain; charset=ISO-8859-1 > On Thu, Apr 8, 2010 at 12:58 AM, Steven Dake wrote: > On Wed, 2010-04-07 at 18:52 +0800, Bernard Chew wrote: >> Hi all, >> >> I noticed "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" repeated >> every few hours in /var/log/messages. What does the message mean and >> is it normal? Will this cause fencing to take place eventually? >> > This means your network environment dropped packets and totem is > recovering them. ?This is normal operation, and in future versions such > as corosync no notification is printed when recovery takes place. > > There is a bug, however, fixed in revision 2122 where if the last packet > in the order is lost, and no new packets are unlost after it, the > processor will enter a failed to receive state and trigger fencing. > > Regards > -steve >> Thank you in advance. >> >> Regards, >> Bernard Chew >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Thank you for the reply Steve! The cluster was running fine until last week where 3 nodes restarted suddenly. I suspect fencing took place since all 3 servers restarted at the same time but I couldn't find any fence related entries in the log. I am guessing we hit the bug you mentioned? Will the log indicate fencing has taken place with regards to the bug you mentioned? Also I noticed the message "kernel: clustat[28328]: segfault at 0000000000000024 rip 0000003b31c75bc0 rsp 00007fff955cb098 error 4" occasionally; is this related to the TOTEM message or they indicate another problem? Regards, Bernard Chew ------------------------------ Message: 4 Date: Fri, 09 Apr 2010 10:01:51 +0100 From: Steven Whitehouse To: linux clustering Subject: Re: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: <1270803711.2753.4.camel at localhost> Content-Type: text/plain; charset="UTF-8" Hi, On Fri, 2010-04-09 at 10:28 +0200, Matthias Leopold wrote: > hi, > > is it possible to mount a gfs volume readonly from outside the cluster > (while the cluster is up and all nodes do I/O)? > No. There is a "spectator" mount where a read only node can mount without having a journal assigned to it, but it must still be part of the cluster, Steve. ------------------------------ Message: 5 Date: Fri, 9 Apr 2010 17:04:33 +0800 From: Bernard Chew To: linux clustering Subject: Re: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: ??? Content-Type: text/plain; charset=ISO-8859-1 > On Fri, Apr 9, 2010 at 4:28 PM, Matthias Leopold wrote: > hi, > > is it possible to mount a gfs volume readonly from outside the cluster > (while the cluster is up and all nodes do I/O)? > > -- > Mit freundlichen Gr?ssen > > Matthias Leopold > System & Network Administration > > Streams Telecommunications GmbH > Universitaetsstrasse 10/7, 1090 Vienna, Austria > > tel: +43 1 40159113 > fax: +43 1 40159300 > ------------------------------------------------ > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > Hi Matthias, I am not the expert here but how about exporting the GFS volume using NFS? Regards, Bernard ------------------------------ Message: 6 Date: Fri, 09 Apr 2010 12:31:55 +0200 From: Matthias Leopold To: linux clustering Subject: Re: [Linux-cluster] mounting gfs volumes outside a cluster? Message-ID: <4BBF021B.3090205 at aic.at> Content-Type: text/plain; charset=ISO-8859-1 Bernard Chew schrieb: >> On Fri, Apr 9, 2010 at 4:28 PM, Matthias Leopold wrote: >> hi, >> >> is it possible to mount a gfs volume readonly from outside the cluster >> (while the cluster is up and all nodes do I/O)? >> >> -- >> Mit freundlichen Gr?ssen >> >> Matthias Leopold >> System & Network Administration >> >> Streams Telecommunications GmbH >> Universitaetsstrasse 10/7, 1090 Vienna, Austria >> >> tel: +43 1 40159113 >> fax: +43 1 40159300 >> ------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > Hi Matthias, > > I am not the expert here but how about exporting the GFS volume using NFS? > > Regards, > Bernard > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster that's an nice idea,thx at second glance this indeed seems to be a viable solution regards, matthias ------------------------------ Message: 7 Date: Fri, 09 Apr 2010 13:39:21 +0200 From: frank To: linux-cluster at redhat.com Subject: [Linux-cluster] Clustering and Cluster-Storage channels Message-ID: <4BBF11E9.803 at si.ct.upc.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi, we have several machines with RH 5.4 and we use Cluster and Cluster-Storage (because we use GFS). We also have troubles in updates because we don't know how to subscribe out machines to that channels. From RHN, in "Software Channel Subscriptions" part, we see: Release Channels for Red Hat Enterprise Linux 5 for x86_64 ? ???RHEL FasTrack (v. 5 for 64-bit x86_64) (Channel Details)? ??? Consumes a regular entitlement (13 available) ? ???RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) (Channel Details)? ???Consumes a regular entitlement (13 available) ? ???RHEL Supplementary (v. 5 for 64-bit x86_64) (Channel Details)? ??? Consumes a regular entitlement (13 available) ? ???RHEL Virtualization (v. 5 for 64-bit x86_64) (Channel Details)? ??? Consumes a regular entitlement (9 available) ? ???Red Hat Network Tools for RHEL Server (v.5 64-bit x86_64) (Channel Details)? ???Consumes a regular entitlement (10 available) BETA Channels for Red Hat Enterprise Linux 5 for x86_64 ? ???RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) Beta (Channel Details)? ???Consumes a regular entitlement (13 available) ? ???RHEL Supplementary (v. 5 for 64-bit x86_64) Beta (Channel Details) ? ???Consumes a regular entitlement (13 available) ? ???RHEL Virtualization (v. 5 for 64-bit x86_64) Beta (Channel Details) ? ???Consumes a regular entitlement (13 available) ? ???Red Hat Enterprise Linux (v. 5 for 64-bit x86_64) Beta (Channel Details)? ???Consumes a regular entitlement (13 available) Additional Services Channels for Red Hat Enterprise Linux 5 for x86_64 ? ???RHEL Hardware Certification (v. 5 for 64-bit x86_64) (Channel Details)? ???Consumes a regular entitlement (13 available) Additional Services BETA Channels for Red Hat Enterprise Linux 5 for x86_64 ? ???RHEL Cluster-Storage (v. 5 for 64-bit x86_64) Beta (Channel Details)? ???Consumes a regular entitlement (10 available) ? ???RHEL Clustering (v. 5 for 64-bit x86_64) Beta (Channel Details)? ??? Consumes a regular entitlement (10 available) ? ???RHEL Hardware Certification (v. 5 for 64-bit x86_64) Beta (Channel Details)? ???Consumes a regular entitlement (13 available) Thera are cluster beta channels, but not the release ones. How can we subscribe systems to them? Thanks and Regards. Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que est? net. ------------------------------ Message: 8 Date: Fri, 09 Apr 2010 14:05:27 +0200 From: "Fabio M. Di Nitto" To: linux clustering ,??? cluster-devel ??? Subject: [Linux-cluster] Cluster 3.0.10 stable release Message-ID: <4BBF1807.6030004 at redhat.com> Content-Type: text/plain; charset=ISO-8859-1 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The cluster team and its community are proud to announce the 3.0.10 stable release from the STABLE3 branch. This release contains a few major bug fixes. We strongly recommend people to update their clusters. In order to build/run the 3.0.10 release you will need: - - corosync 1.2.1 - - openais 1.1.2 - - linux kernel 2.6.31 (only for GFS1 users) The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.10.tar.bz2 To report bugs or issues: ???https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? ???Join us on IRC (irc.freenode.net #linux-cluster) and share your ???experience? with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.0.9): Abhijith Das (5): ? ? ? gfs2_quota: Fix gfs2_quota to handle boundary conditions ? ? ? gfs2_convert: gfs2_convert segfaults when converting filesystems of blocksize 512 bytes ? ? ? gfs2_convert: gfs2_convert uses too much memory for jdata conversion ? ? ? gfs2_convert: Fix conversion of gfs1 CDPNs ? ? ? gfs2_convert: Doesn't convert indirectly-pointed extended attributes correctly Bob Peterson (3): ? ? ? gfs2: GFS2 utilities should make use of exported device topology ? ? ? cman: gfs_controld dm suspend hangs withdrawn GFS file system ? ? ? GFS2: fsck.gfs2 segfault - osi_tree "each_safe" patch Christine Caulfield (3): ? ? ? cman: Add improved cluster_id hash function ? ? ? cman: move fnv hash function into its own file ? ? ? ccs: Remove non-existant commands from ccs_tool man page. David Teigland (7): ? ? ? dlm_controld/libdlmcontrol/dlm_tool: separate plock debug buffer ? ? ? dlm_controld: add more fs_notified debugging ? ? ? dlm_controld/gfs_controld: avoid full plock unlock when no resource exists ? ? ? dlm_controld: add plock checkpoint signatures ? ? ? dlm_controld: set last_plock_time for ownership operations ? ? ? dlm_controld: don't skip unlinking checkpoint ? ? ? gfs_controld: set last_plock_time for ownership operations Fabio M. Di Nitto (1): ? ? ? dlm: bump libdlmcontrol sominor Jan Friesse (1): ? ? ? fencing: SNMP fence agents don't fail Lon Hohberger (4): ? ? ? config: Add hash_cluster_id to schema ? ? ? rgmanager: Fix 2+ simultaneous relocation crash ? ? ? rgmanager: Fix memory leaks during relocation ? ? ? rgmanager: Fix tiny memory leak during reconfig Marek 'marx' Grac (1): ? ? ? fencing: Remove 'ipport' option from WTI fence agent cman/daemon/Makefile? ? ? ? ? ? ? ???|? ? 3 +- cman/daemon/cman-preconfig.c? ? ? ???|???32 +++- cman/daemon/fnvhash.c? ? ? ? ? ? ? ? |???93 +++++++++ cman/daemon/fnvhash.h? ? ? ? ? ? ? ? |? ? 1 + config/plugins/ldap/99cluster.ldif???|???10 +- config/plugins/ldap/ldap-base.csv? ? |? ? 3 +- config/tools/man/ccs_tool.8? ? ? ? ? |???15 +-- config/tools/xml/cluster.rng.in? ? ? |? ? 3 + dlm/libdlmcontrol/Makefile? ? ? ? ???|? ? 2 + dlm/libdlmcontrol/libdlmcontrol.h? ? |? ? 1 + dlm/libdlmcontrol/main.c? ? ? ? ? ???|? ? 5 + dlm/man/dlm_tool.8? ? ? ? ? ? ? ? ???|? ? 4 + dlm/tool/main.c? ? ? ? ? ? ? ? ? ? ? |???25 +++- doc/COPYRIGHT? ? ? ? ? ? ? ? ? ? ? ? |? ? 6 + fence/agents/lib/fencing_snmp.py.py? |???13 +- fence/agents/wti/fence_wti.py? ? ? ? |? ? 2 +- gfs2/convert/gfs2_convert.c? ? ? ? ? |? 377 ++++++++++++++++++++++++++++------ gfs2/fsck/link.c? ? ? ? ? ? ? ? ? ???|? ? 2 - gfs2/fsck/pass1b.c? ? ? ? ? ? ? ? ???|???10 +- gfs2/fsck/pass3.c? ? ? ? ? ? ? ? ? ? |? ? 5 +- gfs2/fsck/pass4.c? ? ? ? ? ? ? ? ? ? |? ? 5 +- gfs2/libgfs2/device_geometry.c? ? ???|???38 ++++- gfs2/libgfs2/fs_ops.c? ? ? ? ? ? ? ? |? ? 1 + gfs2/libgfs2/libgfs2.h? ? ? ? ? ? ???|???10 + gfs2/mkfs/main_mkfs.c? ? ? ? ? ? ? ? |???69 ++++++- gfs2/quota/check.c? ? ? ? ? ? ? ? ???|? ? 3 +- gfs2/quota/main.c? ? ? ? ? ? ? ? ? ? |? 104 +++------- group/dlm_controld/cpg.c? ? ? ? ? ???|???62 ++++-- group/dlm_controld/dlm_controld.h? ? |? ? 1 + group/dlm_controld/dlm_daemon.h? ? ? |???44 +++- group/dlm_controld/main.c? ? ? ? ? ? |???60 +++++- group/dlm_controld/plock.c? ? ? ? ???|? 241 ++++++++++++++-------- group/gfs_controld/plock.c? ? ? ? ???|???12 +- group/gfs_controld/util.c? ? ? ? ? ? |? ? 3 +- rgmanager/src/daemons/event_config.c |? ? 2 + rgmanager/src/daemons/rg_state.c? ???|? ? 2 + rgmanager/src/daemons/rg_thread.c? ? |? ? 4 +- 37 files changed, 967 insertions(+), 306 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJLvxgDAAoJEFA6oBJjVJ+OOzUP/3Wl1ChlhmcjVvClhDyZhI4q aPSTnChG1b40WB7sh7UQsVcD0mwAsPPsgDaZZUlybhWl2LylxZ5xEwu7VWoL8SwJ 8Q4aYT1Svp6jFfvqdmoRFmJfjp+vc3y7Gllx3NP6kLmf62TbTROgbc3X++72IFkf 14DPEonWao2FzKx7MaoZCSttc0djuILd+UNh7EEgqC2lyR2r3tatmCa1i/eT2Pfy fwISqy4ioNie5i5SMO7fS9y4NCLnognMgeuH5iS5EJDUViougWyQSSorI8SQq36f ZRyrrUwuUivT2ylXyz3TgfuojGpRuFy2AC1oBxRsiDOVyMrVFHX4NaP5E18J4qs1 0acYMULOpZYcwgKaLMy6haiYWwfvjFvI71zs4mKijmsWvuPbGTyVx7yxDJJco8SM OQBF5holEHqOo4FVekFa6De0GUMjfgmpGhfPTtuw04/ww5pbNp84Y4TzEOsRA9dd H6ak9yLwN4chjyDWRQxHsDnxCf67oqYDZJL5t1QlMauxruGYdXU3xIZRC9E4oYbW +vu+DTbkMGg70xg2MbXH3E7EkGHeJ9EWgiuEh5l4pavrEo14rf80O0dtf+myn8t7 HosKmXjjdnjaVfYNimUH7/0mnISxX2YOO9uzBD6A/X9bqxrxC1Ky6TdI6tFN80dz nH3IJrLomvkmnadhFRqg =Z/pM -----END PGP SIGNATURE----- ------------------------------ Message: 9 Date: Fri, 9 Apr 2010 10:00:51 -0300 From: Carlos Maiolino To: linux clustering Subject: Re: [Linux-cluster] Clustering and Cluster-Storage channels Message-ID: <20100409130051.GA31186 at andromeda.usersys.redhat.com> Content-Type: text/plain; charset=iso-8859-1 On Fri, Apr 09, 2010 at 01:39:21PM +0200, frank wrote: > Hi, > we have several machines with RH 5.4 and we use Cluster and > Cluster-Storage (because we use GFS). > We also have troubles in updates because we don't know how to > subscribe out machines to that channels. From RHN, in "Software > Channel Subscriptions" part, we see: > > Release Channels for Red Hat Enterprise Linux 5 for x86_64 >? ???RHEL FasTrack (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (13 available) >? ???RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) > (Channel Details)? ???Consumes a regular entitlement (13 available) >? ???RHEL Supplementary (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (13 available) >? ???RHEL Virtualization (v. 5 for 64-bit x86_64) (Channel Details) > Consumes a regular entitlement (9 available) >? ???Red Hat Network Tools for RHEL Server (v.5 64-bit x86_64) > (Channel Details)? ???Consumes a regular entitlement (10 available) > BETA Channels for Red Hat Enterprise Linux 5 for x86_64 >? ???RHEL Optional Productivity Apps (v. 5 for 64-bit x86_64) Beta > (Channel Details)? ???Consumes a regular entitlement (13 available) >? ???RHEL Supplementary (v. 5 for 64-bit x86_64) Beta (Channel > Details)? ???Consumes a regular entitlement (13 available) >? ???RHEL Virtualization (v. 5 for 64-bit x86_64) Beta (Channel > Details)? ???Consumes a regular entitlement (13 available) >? ???Red Hat Enterprise Linux (v. 5 for 64-bit x86_64) Beta (Channel > Details)? ???Consumes a regular entitlement (13 available) > Additional Services Channels for Red Hat Enterprise Linux 5 for x86_64 >? ???RHEL Hardware Certification (v. 5 for 64-bit x86_64) (Channel > Details)? ???Consumes a regular entitlement (13 available) > Additional Services BETA Channels for Red Hat Enterprise Linux 5 for x86_64 >? ???RHEL Cluster-Storage (v. 5 for 64-bit x86_64) Beta (Channel > Details)? ???Consumes a regular entitlement (10 available) >? ???RHEL Clustering (v. 5 for 64-bit x86_64) Beta (Channel Details) > Consumes a regular entitlement (10 available) >? ???RHEL Hardware Certification (v. 5 for 64-bit x86_64) Beta > (Channel Details)? ???Consumes a regular entitlement (13 available) > > Thera are cluster beta channels, but not the release ones. How can > we subscribe systems to them? > > Thanks and Regards. > > Frank > > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que est? net. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hello Frank. I guess is better you to contact Red Hat support, once this looks like a subscription problem than a thecnical problem. see you ;) -- --- Best Regards Carlos Eduardo Maiolino ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 72, Issue 9 ******************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcasale at activenetwerx.com Sat Apr 10 17:19:08 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Sat, 10 Apr 2010 17:19:08 +0000 Subject: [Linux-cluster] Service recovery policy question Message-ID: I have a service that I have changed the default from restart to relocate. Now I need to add one more script resource to it, but this script resource is not essential, and if it tanks, a restart will certainly be sufficient so I assume I have to create another service so I can tune the Restart Policy Extensions for it, whereas the Recovery Policy for the original remains as Relocate. I have looked all through the wiki and net, but I can't see how to group or tie services together so they always follow each other around? These two services can't be on distinct nodes, anyone have any ideas? Thanks, jlc From duplessis.jacques at gmail.com Sun Apr 11 22:51:16 2010 From: duplessis.jacques at gmail.com (Jacques Duplessis) Date: Sun, 11 Apr 2010 15:51:16 -0700 (PDT) Subject: [Linux-cluster] =?utf-8?q?Invitation_=C3=A0_se_connecter_sur_Link?= =?utf-8?q?edIn?= Message-ID: <1726321160.2908294.1271026276594.JavaMail.app@ech3-cdn11.prod> LinkedIn ------------Jacques Duplessis requested to add you as a connection on LinkedIn: ------------------------------------------ Marian, J'aimerais vous inviter ? rejoindre mon r?seau professionnel en ligne, sur le site LinkedIn. Jacques Accept invitation from Jacques Duplessis http://www.linkedin.com/e/ulDuieLaAX544oVCOYcgj_GaXIys4TuLMXGmOx/blk/I1956975689_2/1BpC5vrmRLoRZcjkkZt5YCpnlOt3RApnhMpmdzgmhxrSNBszYOnPAUdzkTejoRej59bSxbiB1gmD1VbPwUcj4QcPoNcz4LrCBxbOYWrSlI/EML_comm_afe/ View invitation from Jacques Duplessis http://www.linkedin.com/e/ulDuieLaAX544oVCOYcgj_GaXIys4TuLMXGmOx/blk/I1956975689_2/39vejwSdjsVdzkVckALqnpPbOYWrSlI/svi/ ------------------------------------------ Why might connecting with Jacques Duplessis be a good idea? People Jacques Duplessis knows can discover your profile: Connecting to Jacques Duplessis will attract the attention of LinkedIn users. See who's been viewing your profile: http://www.linkedin.com/e/wvp/inv18_wvmp/ ------ (c) 2010, LinkedIn Corporation -------------- next part -------------- An HTML attachment was scrubbed... URL: From bahkha at gmail.com Mon Apr 12 11:09:22 2010 From: bahkha at gmail.com (Bachman Kharazmi) Date: Mon, 12 Apr 2010 13:09:22 +0200 Subject: [Linux-cluster] ccsd not starting In-Reply-To: References: Message-ID: Hi I'm running Debail Lenny where packages: gfs2-tools and redhat-cluster-suite are installed. When I do /etc/init.d/cman start ? I get: Starting cluster manager: ?Loading kernel modules: done ?Mounting config filesystem: done ?Starting cluster configuration system: done ?Joining cluster:cman_tool: ccsd is not running ?done ?Starting daemons: groupd fenced dlm_controld gfs_controld ?Joining fence domain:fence_tool: can't communicate with fenced -1 ?done ?Starting Quorum Disk daemon: done ccsd doesn't run, and that is the reason why fence cannot communicate? My cluster.conf and /etc/default/cman looks like: web3:~# cat /etc/default/cman CLUSTERNAME="cluster" NODENAME="web3" USE_CCS="yes" CLUSTER_JOIN_TIMEOUT=300 CLUSTER_JOIN_OPTIONS="" CLUSTER_SHUTDOWN_TIMEOUT=60 web3:~# cat /etc/cluster/cluster.conf ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? web3:~# /usr/sbin/ccsd web3:~# ps ax | grep ccsd 11935 pts/0 ? ?S+ ? ? 0:00 grep ccsd strace /usr/sbin/ccsd output: http://pastebin.ca/1859435 the process seem to die after reading the cluster.conf I have a iscsi block device /dev/sda available at three initiators. gfs2 fs is created using: mkfs.gfs2 -t cluster:share1 -p lock_dlm -j 4 /dev/sda1 kernel is default: 2.6.26-2-amd64 Have I missed anything to make the cman startup work properly? From jose.neto at liber4e.com Mon Apr 12 13:34:54 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Mon, 12 Apr 2010 13:34:54 -0000 (GMT) Subject: [Linux-cluster] iscsi qdisk failure cause reboot Message-ID: Hi2All I have the following setup: .2node + qdisk ( iscsi w/ 2network paths and multipath ) on qkdisk I have allow_kill=0 and reboot=1 since I have some heuristics and want to force some switching on network events the issue I'm facing now is that on iscsi problems on 1node ( network down for ex ) I have no impact on cluster ( witch is ok for me ) but at recovery the node gets rebooted ( not fenced by the other node ) If on iscsi going down, I do a qdisk stop, then iscsi recover, then qdisk start I get no reboot Is this proper qdisk behavior? It keeps track of some error and forces reboot? Thanks Jose From brem.belguebli at gmail.com Mon Apr 12 14:48:19 2010 From: brem.belguebli at gmail.com (brem belguebli) Date: Mon, 12 Apr 2010 16:48:19 +0200 Subject: [Linux-cluster] iscsi qdisk failure cause reboot In-Reply-To: References: Message-ID: Hi Jose, check out the logs of the other nodes (the ones that remained alive) to see if you don't have a message telling you that the node was killed "because it has rejoined the cluster with existing state" Also, you could add a max_error_cycles="your value" to your in order to make qdisk exit after "your value" missed cycles. I have posted a message a few times ago about this feature 'max_error_cycles' not working, but I was wrong....Thx Lon If your quorum device is multipathed, make sure you don't queue (no_path_retry queue) as it won't generate an ioerror to the upper layer (qdisk) and that the number of retries isn't higher than your qdisk interval (in my setup, no_path_retry fail, which means immediate ioerror). Brem 2010/4/12 jose nuno neto : > Hi2All > > I have the following setup: > .2node + qdisk ( iscsi w/ 2network paths and multipath ) > > on qkdisk I have allow_kill=0 and reboot=1 since I have some heuristics > and want to force some switching on network events > > the issue I'm facing now is that on iscsi problems on 1node ( network down > for ex ) > I have no impact on cluster ( witch is ok for me ) but at recovery the > node gets rebooted ( not fenced by the other node ) > > If on iscsi going down, I do a qdisk stop, then iscsi recover, then qdisk > start I get no reboot > > Is this proper qdisk behavior? It keeps track of some error and forces > reboot? > > Thanks > Jose > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From swap_project at yahoo.com Mon Apr 12 21:57:10 2010 From: swap_project at yahoo.com (Srija) Date: Mon, 12 Apr 2010 14:57:10 -0700 (PDT) Subject: [Linux-cluster] Configuration httpd service in linux cluster In-Reply-To: Message-ID: <37531.56590.qm@web112802.mail.gq1.yahoo.com> Hi, I am trying to configure httpd service in my 3 nodes cluster environment. (RHEL5.4 86_64). I am new to the cluster configuration. I have followed the document as follows: http://www.linuxtopia.org/online_books/linux_system_administration/redhat_cluster_configuration_and_management/s1-apache-inshttpd.html But somehow it is not working. The node i am assiging is getting fenced. Sometimes the sever getting hung at the starting of clvmd. For this configuration I have kept my data in a lvm partition , and this partition I am using as the httpd content . Also I am not understanding which IP i will assign for this service. I configured first with public ip of the server, it did not work. Then i configured with the private IP, it did not work too. If anybody guides me with a documentation which I can understand and follow, it will be really appreciated. Thanks in advance. From pmdyer at ctgcentral2.com Mon Apr 12 22:22:57 2010 From: pmdyer at ctgcentral2.com (Paul M. Dyer) Date: Mon, 12 Apr 2010 17:22:57 -0500 (CDT) Subject: [Linux-cluster] Configuration httpd service in linux cluster In-Reply-To: <37531.56590.qm@web112802.mail.gq1.yahoo.com> Message-ID: <35054.21271110977480.JavaMail.root@athena> Configure a unique IP address, different from the public or private addresses already used. Probably, you want the Apache IP to be on the public subnet. The Apache IP address will move to different nodes of the cluster along with the Apache service. Paul ----- Original Message ----- From: "Srija" To: "linux clustering" Sent: Monday, April 12, 2010 4:57:10 PM (GMT-0600) America/Chicago Subject: [Linux-cluster] Configuration httpd service in linux cluster Hi, I am trying to configure httpd service in my 3 nodes cluster environment. (RHEL5.4 86_64). I am new to the cluster configuration. I have followed the document as follows: http://www.linuxtopia.org/online_books/linux_system_administration/redhat_cluster_configuration_and_management/s1-apache-inshttpd.html But somehow it is not working. The node i am assiging is getting fenced. Sometimes the sever getting hung at the starting of clvmd. For this configuration I have kept my data in a lvm partition , and this partition I am using as the httpd content . Also I am not understanding which IP i will assign for this service. I configured first with public ip of the server, it did not work. Then i configured with the private IP, it did not work too. If anybody guides me with a documentation which I can understand and follow, it will be really appreciated. Thanks in advance. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From bahkha at gmail.com Mon Apr 12 22:12:14 2010 From: bahkha at gmail.com (Bachman Kharazmi) Date: Tue, 13 Apr 2010 00:12:14 +0200 Subject: [Linux-cluster] [solved] Re: ccsd not starting In-Reply-To: References: Message-ID: I had disabled the ipv6 module in Debian which caused that ccsd could not start. The default settings in Debian is no start argument, which means both ipv4 and ipv6 enabled, and the current stable ccsd in Lenny cannot start if ipv6 is disabled in OS and no start argument "-4" specified. From what I have heard the official Lenny packages are old (cman 2.20081102-1+lenny1). Unfortunately the ccsd did not log to messages about the missing support on start-up, but /usr/sbin/ccsd -n did print to stdout. On 12 April 2010 13:09, Bachman Kharazmi wrote: > Hi > I'm running Debail Lenny where packages: gfs2-tools and > redhat-cluster-suite are installed. > When I do /etc/init.d/cman start ? I get: > Starting cluster manager: > ?Loading kernel modules: done > ?Mounting config filesystem: done > ?Starting cluster configuration system: done > ?Joining cluster:cman_tool: ccsd is not running > > ?done > ?Starting daemons: groupd fenced dlm_controld gfs_controld > ?Joining fence domain:fence_tool: can't communicate with fenced -1 > ?done > ?Starting Quorum Disk daemon: done > > ccsd doesn't run, and that is the reason why fence cannot communicate? > > My cluster.conf and /etc/default/cman looks like: > > web3:~# cat /etc/default/cman > CLUSTERNAME="cluster" > NODENAME="web3" > USE_CCS="yes" > CLUSTER_JOIN_TIMEOUT=300 > CLUSTER_JOIN_OPTIONS="" > CLUSTER_SHUTDOWN_TIMEOUT=60 > > web3:~# cat /etc/cluster/cluster.conf > > > > > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > > ? ? ? ? > > > > web3:~# /usr/sbin/ccsd > web3:~# ps ax | grep ccsd > 11935 pts/0 ? ?S+ ? ? 0:00 grep ccsd > > strace /usr/sbin/ccsd output: http://pastebin.ca/1859435 > the process seem to die after reading the cluster.conf > > I have a iscsi block device /dev/sda available at three initiators. > > gfs2 fs is created using: > mkfs.gfs2 -t cluster:share1 -p lock_dlm -j 4 /dev/sda1 > > kernel is default: 2.6.26-2-amd64 > Have I missed anything to make the cman startup work properly? > From jose.neto at liber4e.com Tue Apr 13 09:07:00 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Tue, 13 Apr 2010 09:07:00 -0000 (GMT) Subject: [Linux-cluster] iscsi qdisk failure cause reboot In-Reply-To: References: Message-ID: <9c3d680350c0de020103c8a6110e4e05.squirrel@fela.liber4e.com> Hi Brem I've tried the max_error_cycles setting and it fix this behavior. Thanks a bunch It seems we're on the same path here.... I'm almost finished :-) SeeYou Jose > Hi Jose, > > check out the logs of the other nodes (the ones that remained alive) > to see if you don't have a message telling you that the node was > killed "because it has rejoined the cluster with existing state" > > Also, you could add a max_error_cycles="your value" to your device..../> in order to make qdisk exit after "your value" missed > cycles. > > I have posted a message a few times ago about this feature > 'max_error_cycles' not working, but I was wrong....Thx Lon > > If your quorum device is multipathed, make sure you don't queue > (no_path_retry queue) as it won't generate an ioerror to the upper > layer (qdisk) and that the number of retries isn't higher than your > qdisk interval (in my setup, no_path_retry fail, which means immediate > ioerror). > > Brem > > > 2010/4/12 jose nuno neto : >> Hi2All >> >> I have the following setup: >> .2node + qdisk ( iscsi w/ 2network paths and multipath ) >> >> on qkdisk I have allow_kill=0 and reboot=1 since I have some heuristics >> and want to force some switching on network events >> >> the issue I'm facing now is that on iscsi problems on 1node ( network >> down >> for ex ) >> I have no impact on cluster ( witch is ok for me ) but at recovery the >> node gets rebooted ( not fenced by the other node ) >> >> If on iscsi going down, I do a qdisk stop, then iscsi recover, then >> qdisk >> start I get no reboot >> >> Is this proper qdisk behavior? It keeps track of some error and forces >> reboot? >> >> Thanks >> Jose >> >> >> >> >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From christoph at macht-blau.org Tue Apr 13 10:02:50 2010 From: christoph at macht-blau.org (C. Handel) Date: Tue, 13 Apr 2010 12:02:50 +0200 Subject: [Linux-cluster] ocf_log Message-ID: On Thu, Apr 1, 2010 at 6:00 PM, wrote: >>> i'm writing a custom resource agent. In the resource agent i try to >>> use the ocf_log funtions but they don't work as expected. When i run >>> the rgmanager in the foreground (clurgmgrd -df) i get all the message >>> i want. When running as a normal daemon i can't find my log entries. >> Have you defined a syslog.conf entry for your local4 facility ? > yes. Messages from logger (which uses the same facility as rm) and > debug messages from the ip resource agent show up. To complete this question for the archives. i missed sbin in my PATH environment variable. Resource Agents (when called from clumgrd) have a default path without sbin. At the beginning of my agent i also included sbin, but i didn't export it. When calling ocf_log, the actual log is done by a call to "clulog" which is in sbin. As the path of the included shellscript is not changed, it is not found. so the beginning of the resource agent is LC_ALL=C LANG=C PATH=/bin:/sbin:/usr/bin:/usr/sbin # remember to export path, so ocf_log function can find its clulog binary export LC_ALL LANG PATH . $(dirname $0)/ocf-shellfuncs Greetings Christoph From rajatjpatel at gmail.com Tue Apr 13 17:36:49 2010 From: rajatjpatel at gmail.com (rajatjpatel) Date: Tue, 13 Apr 2010 23:06:49 +0530 Subject: [Linux-cluster] Linux-cluster Digest, Vol 72, Issue 13 In-Reply-To: References: Message-ID: Regards, Rajat J Patel FIRST THEY IGNORE YOU... THEN THEY LAUGH AT YOU... THEN THEY FIGHT YOU... THEN YOU WIN... 1. Configuration httpd service in linux cluster (Srija) > Hi Srija, Which hardware are you using for cluster following link will help you setup cluster HA http://studyhat.blogspot.com/2009/11/clustering-linux-ha.html http://studyhat.blogspot.com/2010/01/cluster-hp-ilo.html what i suggest you just follow the 2nd link for setting up bond0 bond1 and then you setup your cluster 2. Re: Configuration httpd service in linux cluster (Paul M. Dyer) 3. [solved] Re: ccsd not starting (Bachman Kharazmi) 4. Re: iscsi qdisk failure cause reboot (jose nuno neto) 5. Re: ocf_log (C. Handel) ---------------------------------------------------------------------- Message: 1 Date: Mon, 12 Apr 2010 14:57:10 -0700 (PDT) From: Srija To: linux clustering Subject: [Linux-cluster] Configuration httpd service in linux cluster Message-ID: <37531.56590.qm at web112802.mail.gq1.yahoo.com> Content-Type: text/plain; charset=us-ascii Hi, I am trying to configure httpd service in my 3 nodes cluster environment. (RHEL5.4 86_64). I am new to the cluster configuration. I have followed the document as follows: http://www.linuxtopia.org/online_books/linux_system_administration/redhat_cluster_configuration_and_management/s1-apache-inshttpd.html But somehow it is not working. The node i am assiging is getting fenced. Sometimes the sever getting hung at the starting of clvmd. For this configuration I have kept my data in a lvm partition , and this partition I am using as the httpd content . Also I am not understanding which IP i will assign for this service. I configured first with public ip of the server, it did not work. Then i configured with the private IP, it did not work too. If anybody guides me with a documentation which I can understand and follow, it will be really appreciated. Thanks in advance. ------------------------------ Message: 2 Date: Mon, 12 Apr 2010 17:22:57 -0500 (CDT) From: "Paul M. Dyer" To: linux clustering Subject: Re: [Linux-cluster] Configuration httpd service in linux cluster Message-ID: <35054.21271110977480.JavaMail.root at athena> Content-Type: text/plain; charset=utf-8 Configure a unique IP address, different from the public or private addresses already used. Probably, you want the Apache IP to be on the public subnet. The Apache IP address will move to different nodes of the cluster along with the Apache service. Paul ----- Original Message ----- From: "Srija" To: "linux clustering" Sent: Monday, April 12, 2010 4:57:10 PM (GMT-0600) America/Chicago Subject: [Linux-cluster] Configuration httpd service in linux cluster Hi, I am trying to configure httpd service in my 3 nodes cluster environment. (RHEL5.4 86_64). I am new to the cluster configuration. I have followed the document as follows: http://www.linuxtopia.org/online_books/linux_system_administration/redhat_cluster_configuration_and_management/s1-apache-inshttpd.html But somehow it is not working. The node i am assiging is getting fenced. Sometimes the sever getting hung at the starting of clvmd. For this configuration I have kept my data in a lvm partition , and this partition I am using as the httpd content . Also I am not understanding which IP i will assign for this service. I configured first with public ip of the server, it did not work. Then i configured with the private IP, it did not work too. If anybody guides me with a documentation which I can understand and follow, it will be really appreciated. Thanks in advance. -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster ------------------------------ Message: 3 Date: Tue, 13 Apr 2010 00:12:14 +0200 From: Bachman Kharazmi To: linux-cluster at redhat.com Subject: [Linux-cluster] [solved] Re: ccsd not starting Message-ID: Content-Type: text/plain; charset=ISO-8859-1 I had disabled the ipv6 module in Debian which caused that ccsd could not start. The default settings in Debian is no start argument, which means both ipv4 and ipv6 enabled, and the current stable ccsd in Lenny cannot start if ipv6 is disabled in OS and no start argument "-4" specified. From what I have heard the official Lenny packages are old (cman 2.20081102-1+lenny1). Unfortunately the ccsd did not log to messages about the missing support on start-up, but /usr/sbin/ccsd -n did print to stdout. On 12 April 2010 13:09, Bachman Kharazmi wrote: > Hi > I'm running Debail Lenny where packages: gfs2-tools and > redhat-cluster-suite are installed. > When I do /etc/init.d/cman start ? I get: > Starting cluster manager: > ?Loading kernel modules: done > ?Mounting config filesystem: done > ?Starting cluster configuration system: done > ?Joining cluster:cman_tool: ccsd is not running > > ?done > ?Starting daemons: groupd fenced dlm_controld gfs_controld > ?Joining fence domain:fence_tool: can't communicate with fenced -1 > ?done > ?Starting Quorum Disk daemon: done > > ccsd doesn't run, and that is the reason why fence cannot communicate? > > My cluster.conf and /etc/default/cman looks like: > > web3:~# cat /etc/default/cman > CLUSTERNAME="cluster" > NODENAME="web3" > USE_CCS="yes" > CLUSTER_JOIN_TIMEOUT=300 > CLUSTER_JOIN_OPTIONS="" > CLUSTER_SHUTDOWN_TIMEOUT=60 > > web3:~# cat /etc/cluster/cluster.conf > > > > > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? ? ? ? ? > ? ? ? ? > > > > > ? ? ? ? > > > > web3:~# /usr/sbin/ccsd > web3:~# ps ax | grep ccsd > 11935 pts/0 ? ?S+ ? ? 0:00 grep ccsd > > strace /usr/sbin/ccsd output: http://pastebin.ca/1859435 > the process seem to die after reading the cluster.conf > > I have a iscsi block device /dev/sda available at three initiators. > > gfs2 fs is created using: > mkfs.gfs2 -t cluster:share1 -p lock_dlm -j 4 /dev/sda1 > > kernel is default: 2.6.26-2-amd64 > Have I missed anything to make the cman startup work properly? > ------------------------------ Message: 4 Date: Tue, 13 Apr 2010 09:07:00 -0000 (GMT) From: "jose nuno neto" To: "linux clustering" Subject: Re: [Linux-cluster] iscsi qdisk failure cause reboot Message-ID: <9c3d680350c0de020103c8a6110e4e05.squirrel at fela.liber4e.com> Content-Type: text/plain;charset=iso-8859-1 Hi Brem I've tried the max_error_cycles setting and it fix this behavior. Thanks a bunch It seems we're on the same path here.... I'm almost finished :-) SeeYou Jose > Hi Jose, > > check out the logs of the other nodes (the ones that remained alive) > to see if you don't have a message telling you that the node was > killed "because it has rejoined the cluster with existing state" > > Also, you could add a max_error_cycles="your value" to your device..../> in order to make qdisk exit after "your value" missed > cycles. > > I have posted a message a few times ago about this feature > 'max_error_cycles' not working, but I was wrong....Thx Lon > > If your quorum device is multipathed, make sure you don't queue > (no_path_retry queue) as it won't generate an ioerror to the upper > layer (qdisk) and that the number of retries isn't higher than your > qdisk interval (in my setup, no_path_retry fail, which means immediate > ioerror). > > Brem > > > 2010/4/12 jose nuno neto : >> Hi2All >> >> I have the following setup: >> .2node + qdisk ( iscsi w/ 2network paths and multipath ) >> >> on qkdisk I have allow_kill=0 and reboot=1 since I have some heuristics >> and want to force some switching on network events >> >> the issue I'm facing now is that on iscsi problems on 1node ( network >> down >> for ex ) >> I have no impact on cluster ( witch is ok for me ) but at recovery the >> node gets rebooted ( not fenced by the other node ) >> >> If on iscsi going down, I do a qdisk stop, then iscsi recover, then >> qdisk >> start I get no reboot >> >> Is this proper qdisk behavior? It keeps track of some error and forces >> reboot? >> >> Thanks >> Jose >> >> >> >> >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > ------------------------------ Message: 5 Date: Tue, 13 Apr 2010 12:02:50 +0200 From: "C. Handel" To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] ocf_log Message-ID: Content-Type: text/plain; charset=ISO-8859-1 On Thu, Apr 1, 2010 at 6:00 PM, wrote: >>> i'm writing a custom resource agent. In the resource agent i try to >>> use the ocf_log funtions but they don't work as expected. When i run >>> the rgmanager in the foreground (clurgmgrd -df) i get all the message >>> i want. When running as a normal daemon i can't find my log entries. >> Have you defined a syslog.conf entry for your local4 facility ? > yes. Messages from logger (which uses the same facility as rm) and > debug messages from the ip resource agent show up. To complete this question for the archives. i missed sbin in my PATH environment variable. Resource Agents (when called from clumgrd) have a default path without sbin. At the beginning of my agent i also included sbin, but i didn't export it. When calling ocf_log, the actual log is done by a call to "clulog" which is in sbin. As the path of the included shellscript is not changed, it is not found. so the beginning of the resource agent is LC_ALL=C LANG=C PATH=/bin:/sbin:/usr/bin:/usr/sbin # remember to export path, so ocf_log function can find its clulog binary export LC_ALL LANG PATH . $(dirname $0)/ocf-shellfuncs Greetings Christoph ------------------------------ -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster End of Linux-cluster Digest, Vol 72, Issue 13 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From swap_project at yahoo.com Wed Apr 14 03:37:26 2010 From: swap_project at yahoo.com (Srija) Date: Tue, 13 Apr 2010 20:37:26 -0700 (PDT) Subject: [Linux-cluster] Configuration httpd service in linux cluster In-Reply-To: <35054.21271110977480.JavaMail.root@athena> Message-ID: <328337.81217.qm@web112810.mail.gq1.yahoo.com> Thank you very much. It worked. While going through the process, I faced two issues, though after rebooting the nodes it has been solved but still need to know without reboot how to solve. The issues are as follows: 1. On one of the node while starting up the clvmd daemon, it was giving time out and after that, the server got hung. 2. If node is locked (locked dlm). Without reboot how to clear the lock mode. Thanks again --- On Mon, 4/12/10, Paul M. Dyer wrote: > From: Paul M. Dyer > Subject: Re: [Linux-cluster] Configuration httpd service in linux cluster > To: "linux clustering" > Date: Monday, April 12, 2010, 6:22 PM > Configure a unique IP address, > different from the public or private addresses already > used.???Probably, you want the Apache IP to > be on the public subnet.???The Apache IP > address will move to different nodes of the cluster along > with the Apache service. > > Paul > > ----- Original Message ----- > From: "Srija" > To: "linux clustering" > Sent: Monday, April 12, 2010 4:57:10 PM (GMT-0600) > America/Chicago > Subject: [Linux-cluster] Configuration httpd service in > linux cluster > > Hi, > > I am trying to configure? httpd service? in my 3 > nodes cluster environment. (RHEL5.4 86_64). I am new to the > cluster configuration. > > I have? followed the document as follows: > http://www.linuxtopia.org/online_books/linux_system_administration/redhat_cluster_configuration_and_management/s1-apache-inshttpd.html > > But somehow it is not working. The node i am assiging is > getting fenced. > Sometimes the sever getting hung at the starting of? > clvmd. > > For this configuration I have kept? my data in a lvm > partition , and > this? partition I am using as the httpd? content > . > > Also I am not understanding which IP i will assign for this > service. > > I configured? first with? public ip of the > server,? it did not work. > Then i configured with the private IP, it did not work > too. > > If anybody guides? me with a documentation which I can > understand and follow, it will be really appreciated. > > Thanks in advance. > > > ? ? ? > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From esggrupos at gmail.com Wed Apr 14 17:07:43 2010 From: esggrupos at gmail.com (ESGLinux) Date: Wed, 14 Apr 2010 19:07:43 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot Message-ID: Hi All, I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and I have configured the fence nodes this way: the problem is that I have run the command fence_node NODE2 and I have seen the halt message and it hasn?t restarted I think it must do a reboot, not a halt. Have I to configure something else (at bios, ilo or SO level)? Thanks in advance ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcasale at activenetwerx.com Wed Apr 14 17:34:00 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Wed, 14 Apr 2010 17:34:00 +0000 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: >I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and I have configured the fence nodes this way: > > > >the problem is that I have run the command fence_node NODE2 and I have seen the halt message ?and it hasn?t restarted > >I think it must do a reboot, not a halt. ?Have I to configure something else (at bios, ilo or SO level)?? > >Thanks in advance http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fenced_ilo http://sources.redhat.com/cluster/wiki/FenceAgentAPI FWIW, and it its only my opinion, but if there is a problem and the other nodes believe they need to fence it, are you sure you want it back online w/o admin intervention? Are you sure a reboot will *always* cure whatever caused it to be fenced? From esggrupos at gmail.com Wed Apr 14 18:00:33 2010 From: esggrupos at gmail.com (ESGLinux) Date: Wed, 14 Apr 2010 20:00:33 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: HI Joseph, thanks for your answer. 2010/4/14 Joseph L. Casale > >I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and I > have configured the fence nodes this way: > > > > login="Administrator" name="ILONODE2" passwd="xxxx"/> > > > >the problem is that I have run the command fence_node NODE2 and I have > seen the halt message and it hasn?t restarted > > > >I think it must do a reboot, not a halt. Have I to configure something > else (at bios, ilo or SO level)? > > > >Thanks in advance > > http://sources.redhat.com/cluster/wiki/FAQ/Fencing#fenced_ilo > http://sources.redhat.com/cluster/wiki/FenceAgentAPI > > I?m going to check this docs. > FWIW, and it its only my opinion, but if there is a problem and the other > nodes believe they need to fence it, are you sure you want it back online > w/o admin intervention? > > Are you sure a reboot will *always* cure whatever caused it to be fenced? > > In my situation I prefer to allways reboot because the machines are not accesible to me. Now I have a machine halted and I?m waiting for a person to push the power button. Perhaps in the final situation it would be better to halt the system, but now I need to make a reboot! What a long time to push a button!!!! perhaps they have to wait 108 minutes to push the button as in "LOST" ;-)))) Greetings, ESG > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mylinuxhalist at gmail.com Wed Apr 14 18:34:59 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Wed, 14 Apr 2010 14:34:59 -0400 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump Message-ID: Setup: 2 Nodes: node1, node2. IPMI fencing mechanism. I'm trying to minimize downtime and to get kdump at the same time; while the fail-over process works fine w/o kdump'ing, I need to tweak post_fail_delay to be high enough to ensure that the panicking node won't get fenced. To ensure that kdump works, I need to set post_fail_delay to 1200 secs (to ensure that dumping process has completed; big memory), and with the post kdump script to sleep for another 1200 seconds. That way, say node1 panic'ed, it would kdump'ing itself and then would go to sleep for a while. node2 then will fence node1 (reboot it via IPMI) and take over the service most likely when node1 was sleeping at the post kdump. This has drawbacks of losing service for 1,200 seconds (while kdumping) and assume that kdump'ing will finish at 1,200 seconds. === Working on a new solution === I'm working on a solution for this by a kdump_pre script. When node1 panic'ed, before kdumping, it would contact node2 so that node2 will attempt to take over the service. At node2, I found running at node1 and issue: clusvcadm -r Because of node1's state (it is kdumping), the command just hangs and it did not manage to cut down the service down time. What can I do at node2 to forcefully take over the service from node1 after node2 is contacted by node1 at kdump_pre stage ? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Wed Apr 14 19:32:22 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 14 Apr 2010 20:32:22 +0100 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: References: Message-ID: <4BC61846.6010904@bobich.net> My LinuxHAList wrote: > What can I do at node2 to forcefully take over the service from node1 > after node2 is contacted by node1 at kdump_pre stage ? Sounds like you need to write your own fencing script that will return successfully fenced status when it knows node2 is down and dumping, and only reboot it some time later. How can you remotely check that the failed node is actually kdumping? Gordan From mylinuxhalist at gmail.com Wed Apr 14 20:22:48 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Wed, 14 Apr 2010 16:22:48 -0400 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: <4BC61846.6010904@bobich.net> References: <4BC61846.6010904@bobich.net> Message-ID: In this case, node1 which is kdumping executes a kdump_pre script that tells node2 that it is kdumping. Upon this notification, node2 then tries to take over the service. I hate my current work-around because: 1) It assumes how much time it takes to do kdump 2) If the machine dies (say power completely unplugged), I would still have post_fail_delay to wait for (unnecessary wait). My guess is custom-fencing will allow me to react differently: 1) While the other node is kdumping, return success (and thus, the rgmanager will go ahead and take over the rsources) 2) If the node is not kdumping, do the usual fencing and return the code for the usual fencing It sounds do-able; it does add another layer of complexity to what I already have (with pre_dump, post_dump scripts, services listening on both ends to know the other node is kdumping) Thanks again for input. On Wed, Apr 14, 2010 at 3:32 PM, Gordan Bobic wrote: > My LinuxHAList wrote: > > What can I do at node2 to forcefully take over the service from node1 >> after node2 is contacted by node1 at kdump_pre stage ? >> > > > Sounds like you need to write your own fencing script that will return > successfully fenced status when it knows node2 is down and dumping, and only > reboot it some time later. How can you remotely check that the failed node > is actually kdumping? > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gordan at bobich.net Wed Apr 14 20:41:36 2010 From: gordan at bobich.net (Gordan Bobic) Date: Wed, 14 Apr 2010 21:41:36 +0100 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: References: <4BC61846.6010904@bobich.net> Message-ID: <4BC62880.7080409@bobich.net> My LinuxHAList wrote: > In this case, node1 which is kdumping executes a kdump_pre script that > tells node2 that it is kdumping. > Upon this notification, node2 then tries to take over the service. Right. So you need to make a kdump notification aware fencing agent that delays fencing when it receives kdump notification but at the same time returns fenced status immediately. Gordan From glisha at gmail.com Wed Apr 14 21:34:32 2010 From: glisha at gmail.com (Georgi Stanojevski) Date: Wed, 14 Apr 2010 23:34:32 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: > > > In my situation I prefer to allways reboot because the machines are not > accesible to me. Now I have a machine halted and I?m waiting for a person to > push the power button. > > Don't need to wait for someone to push the button. You do have access (poweron/poweroff) from the ILO from the node that actually fenced the device, so you can power it back on your self. alive node# fence_ilo -a 192.168.1.2 -l Administrator -p xxx -o on Or if that doesn't work, open firefox on the alive node and access https://192.168.2.1 (the halted systems ilo). -- Glisha -------------- next part -------------- An HTML attachment was scrubbed... URL: From Chris.Jankowski at hp.com Wed Apr 14 22:35:18 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Wed, 14 Apr 2010 22:35:18 +0000 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: <036B68E61A28CA49AC2767596576CD596B7F266004@GVW1113EXC.americas.hpqcorp.net> ESG, Yes, there is a BIOS entry that you need to modify - "Boot on power on" or some such. I do not remember from the top of my head where it is in the RBSU menu structure, but you can certainly configure it to reboot after fence through iLO. I did that a few months ago for a customer on DL380 G6. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of ESGLinux Sent: Thursday, 15 April 2010 03:08 To: linux clustering Subject: [Linux-cluster] fence_ilo halt instead reboot Hi All, I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and I have configured the fence nodes this way: the problem is that I have run the command fence_node NODE2 and I have seen the halt message and it hasn?t restarted I think it must do a reboot, not a halt. Have I to configure something else (at bios, ilo or SO level)? Thanks in advance ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From brem.belguebli at gmail.com Thu Apr 15 01:07:13 2010 From: brem.belguebli at gmail.com (brem belguebli) Date: Thu, 15 Apr 2010 03:07:13 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: <1271293633.3885.4.camel@localhost> On Wed, 2010-04-14 at 23:34 +0200, Georgi Stanojevski wrote: > > In my situation I prefer to allways reboot because the > machines are not accesible to me. Now I have a machine halted > and I?m waiting for a person to push the power button. > > I do prefer reboot also as my cluster stack is not auto started, but under the admin control > > Don't need to wait for someone to push the button. You do have access > (poweron/poweroff) from the ILO from the node that actually fenced the > device, so you can power it back on your self. > You can still remotely power on the system via ILO. As long as power cords are not removed, ILO is reachable wether or not the system is powered off > alive node# fence_ilo -a 192.168.1.2 -l Administrator -p xxx -o on > > Or if that doesn't work, open firefox on the alive node and access > https://192.168.2.1 (the halted systems ilo). > > -- > Glisha > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From bergman at merctech.com Wed Apr 14 23:20:33 2010 From: bergman at merctech.com (bergman at merctech.com) Date: Wed, 14 Apr 2010 19:20:33 -0400 Subject: [Linux-cluster] request for enhancement: avoid or parallelize quotacheck in fs.sh Message-ID: <15954.1271287233@localhost> Right now all the users here and I are waiting for "quotacheck" to run on 4 filesystems, totalling about 14TB. It's bad enough that quotacheck itself is slow, worse that fs.sh runs quotacheck on each startup, but the really terrible part is that quotacheck is run on each filesystem serially. The best thing would be to check if the filesystem was umounted cleanly and skip quotacheck in that instance. If quotacheck must be run, would be a significant (and simple) enhancement to fs.sh to run all the quotachecks in parallel by running each in the background and using wait() to prevent fs.sh from moving to later stages before the quotachecks are finished. Environment: CentOS 5.4 (2.6.18-164.15.1.el5) RHCS (cman) 2.0.115-1.el5_4.9 Thanks, Mark From esggrupos at gmail.com Thu Apr 15 07:08:10 2010 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 15 Apr 2010 09:08:10 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: <1271293633.3885.4.camel@localhost> References: <1271293633.3885.4.camel@localhost> Message-ID: 2010/4/15 brem belguebli > On Wed, 2010-04-14 at 23:34 +0200, Georgi Stanojevski wrote: > > > > In my situation I prefer to allways reboot because the > > machines are not accesible to me. Now I have a machine halted > > and I?m waiting for a person to push the power button. > > > > > I do prefer reboot also as my cluster stack is not auto started, but > under the admin control > > > > Don't need to wait for someone to push the button. You do have access > > (poweron/poweroff) from the ILO from the node that actually fenced the > > device, so you can power it back on your self. > > > You can still remotely power on the system via ILO. As long as power > cords are not removed, ILO is reachable wether or not the system is > powered off > > alive node# fence_ilo -a 192.168.1.2 -l Administrator -p xxx -o on > > > > Or if that doesn't work, open firefox on the alive node and access > > https://192.168.2.1 (the halted systems ilo). > > > I tried with firefox but I didn?t know how to do it with the iLO admin interface. I suposse I also could do it with the fence_ilo command but I failed again (so I decided to go home :-( ) Now I have the machine on. I?m going to try again, Stay tuned ;-) ESG > > -- > > Glisha > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rajatjpatel at gmail.com Thu Apr 15 07:59:15 2010 From: rajatjpatel at gmail.com (rajatjpatel) Date: Thu, 15 Apr 2010 13:29:15 +0530 Subject: [Linux-cluster] Fwd: fence_ilo In-Reply-To: <82d055df1001300112s572a7021w6b8cedbe78466123@mail.gmail.com> References: <82d055df1001300112s572a7021w6b8cedbe78466123@mail.gmail.com> Message-ID: you can replace the fence_ilo file at /sbin/ bcoz it is version pro with firmware. It will work!! Regards, Rajat J Patel D 803 Royal Classic Link Road Andheri West Mumbai 53 +919920121211 www.taashee.com FIRST THEY IGNORE YOU... THEN THEY LAUGH AT YOU... THEN THEY FIGHT YOU... THEN YOU WIN... -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fence_ilo Type: application/octet-stream Size: 3304 bytes Desc: not available URL: From esggrupos at gmail.com Thu Apr 15 08:15:55 2010 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 15 Apr 2010 10:15:55 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: <1271293633.3885.4.camel@localhost> Message-ID: Hi, I have modified the cluster.conf with this: and I have halted de machine again :-( Perhaps the problem is in the bios as Chris said. I?m tryin to power up the machine with this command: fence_ilo -a 192.168.1.2 -l Administrator -p xxxx -o on I get this message Success: Already ON but I can?t access the machine. :-( I have to wait again!! ESG 2010/4/15 ESGLinux > > > 2010/4/15 brem belguebli > > On Wed, 2010-04-14 at 23:34 +0200, Georgi Stanojevski wrote: >> > >> > In my situation I prefer to allways reboot because the >> > machines are not accesible to me. Now I have a machine halted >> > and I?m waiting for a person to push the power button. >> > >> > >> I do prefer reboot also as my cluster stack is not auto started, but >> under the admin control >> > >> > Don't need to wait for someone to push the button. You do have access >> > (poweron/poweroff) from the ILO from the node that actually fenced the >> > device, so you can power it back on your self. >> > >> You can still remotely power on the system via ILO. As long as power >> cords are not removed, ILO is reachable wether or not the system is >> powered off > > >> > alive node# fence_ilo -a 192.168.1.2 -l Administrator -p xxx -o on >> > >> > Or if that doesn't work, open firefox on the alive node and access >> > https://192.168.2.1 (the halted systems ilo). >> > >> > > I tried with firefox but I didn?t know how to do it with the iLO admin > interface. I suposse I also could do it with the fence_ilo command but I > failed again (so I decided to go home :-( ) > > Now I have the machine on. I?m going to try again, > > Stay tuned ;-) > > ESG > > > > >> > -- >> > Glisha >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From keisuke.mori+ha at gmail.com Thu Apr 15 11:05:45 2010 From: keisuke.mori+ha at gmail.com (Keisuke MORI) Date: Thu, 15 Apr 2010 20:05:45 +0900 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: References: <4BC61846.6010904@bobich.net> Message-ID: Hi, My LinuxHAList writes: > My guess is custom-fencing will allow me to react differently: > 1) While the other node is kdumping, return success (and thus, the rgmanager > will go ahead and take over the rsources) > 2) If the node is not kdumping, do the usual fencing and return the code for the > usual fencing > > It sounds do-able; it does add another layer of complexity to what I already > have (with pre_dump, post_dump scripts, services listening on both ends to know > the other node is kdumping) I have once developed the exactly same function before, although it was on Pacemaker/Heartbeat cluster. Please find the tool and the discussion at the archive here: http://www.gossamer-threads.com/lists/linuxha/dev/51968 The tool consists of two parts: 1) a fencing agent which checks whether if the other node is kdumping or not, and if it is, return success to continue the fail-over. 2) a customization for the 2nd kernel to allow to be checked remotely from the other node. The former one might be Pacemaker/Heartbeat specific, but the latter one should be usable regardless of the cluster stack. I would be glad if it helps you, and also it would be great if the customization of the 2nd kernel would be incorporated into the standard RedHat distribution so that everybody can easily use this feature. Regards, Keisuke MORI From esggrupos at gmail.com Thu Apr 15 15:07:25 2010 From: esggrupos at gmail.com (ESGLinux) Date: Thu, 15 Apr 2010 17:07:25 +0200 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: <1271293633.3885.4.camel@localhost> Message-ID: Hi all, Now it works, The problem was simply. Someone had left the installation CD in the drive, so the system didn?t start. what a silly thing!!! Thank you all for your help, ESG 2010/4/15 ESGLinux > Hi, > > I have modified the cluster.conf with this: > > > > > option="reboot"/> > > > > > and I have halted de machine again :-( > > Perhaps the problem is in the bios as Chris said. > > I?m tryin to power up the machine with this command: > > fence_ilo -a 192.168.1.2 -l Administrator -p xxxx -o on > I get this message > Success: Already ON > > but I can?t access the machine. :-( I have to wait again!! > > ESG > > > > > 2010/4/15 ESGLinux > > >> >> 2010/4/15 brem belguebli >> >> On Wed, 2010-04-14 at 23:34 +0200, Georgi Stanojevski wrote: >>> > >>> > In my situation I prefer to allways reboot because the >>> > machines are not accesible to me. Now I have a machine halted >>> > and I?m waiting for a person to push the power button. >>> > >>> > >>> I do prefer reboot also as my cluster stack is not auto started, but >>> under the admin control >>> > >>> > Don't need to wait for someone to push the button. You do have access >>> > (poweron/poweroff) from the ILO from the node that actually fenced the >>> > device, so you can power it back on your self. >>> > >>> You can still remotely power on the system via ILO. As long as power >>> cords are not removed, ILO is reachable wether or not the system is >>> powered off >> >> >>> > alive node# fence_ilo -a 192.168.1.2 -l Administrator -p xxx -o on >>> > >>> > Or if that doesn't work, open firefox on the alive node and access >>> > https://192.168.2.1 (the halted systems ilo). >>> > >>> >> >> I tried with firefox but I didn?t know how to do it with the iLO admin >> interface. I suposse I also could do it with the fence_ilo command but I >> failed again (so I decided to go home :-( ) >> >> Now I have the machine on. I?m going to try again, >> >> Stay tuned ;-) >> >> ESG >> >> >> >> >>> > -- >>> > Glisha >>> > -- >>> > Linux-cluster mailing list >>> > Linux-cluster at redhat.com >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mylinuxhalist at gmail.com Thu Apr 15 14:11:26 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Thu, 15 Apr 2010 10:11:26 -0400 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: References: <4BC61846.6010904@bobich.net> Message-ID: HI Keisuke, Thanks for the input and ideas. It would be nice to have the features incorporated into "standard" distros. It seems that your solution will obviate the needs for kdump_pre script and kdump_post script. My initial reaction is to use kdump_pre scripts and post scripts to broadcast out which node is kdumping and when they are finished kdump'ing. More things to consider... Thanks On Thu, Apr 15, 2010 at 7:05 AM, Keisuke MORI > wrote: > Hi, > > My LinuxHAList writes: > > My guess is custom-fencing will allow me to react differently: > > 1) While the other node is kdumping, return success (and thus, the > rgmanager > > will go ahead and take over the rsources) > > 2) If the node is not kdumping, do the usual fencing and return the code > for the > > usual fencing > > > > It sounds do-able; it does add another layer of complexity to what I > already > > have (with pre_dump, post_dump scripts, services listening on both ends > to know > > the other node is kdumping) > > I have once developed the exactly same function before, > although it was on Pacemaker/Heartbeat cluster. > > Please find the tool and the discussion at the archive here: > http://www.gossamer-threads.com/lists/linuxha/dev/51968 > > The tool consists of two parts: > 1) a fencing agent which checks whether if the other node is > kdumping or not, and if it is, return success to continue the fail-over. > 2) a customization for the 2nd kernel to allow to be checked remotely > from the other node. > > The former one might be Pacemaker/Heartbeat specific, but the > latter one should be usable regardless of the cluster stack. > > I would be glad if it helps you, and also it would be great if > the customization of the 2nd kernel would be incorporated into > the standard RedHat distribution so that everybody can easily > use this feature. > > Regards, > > Keisuke MORI > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From are at gmx.es Thu Apr 15 17:05:01 2010 From: are at gmx.es (Alex Re) Date: Thu, 15 Apr 2010 19:05:01 +0200 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node Message-ID: <4BC7473D.9060809@gmx.es> Good afternoon, I'm trying to form my first cluster of two nodes, using iLO fence devices. I need some help because I can't find what I've missed. My main problem is that the "service cman start" reboots the other node and I can't form the two nodes cluster. I'm using (at both nodea and nodeb, they are on the same VLAN and pings each other ok): [root at nodea ~]# uname -a Linux nodea 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux [root at nodea ~]# rpm -qa |grep cman cman-2.0.115-1.el5_4.9 [root at nodea ~]# cat /etc/cluster/cluster.conf (nodeb has the same file) When I start the cman service, it hangs up for some time at the "Starting fencing..." step and after those configured 25secs it fences nodeb and reboots it. [root at nodea ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done [ OK ] "nodeb" gets rebooted: [root at nodeb ~]# Broadcast message from root (Thu Apr 15 18:42:24 2010): The system is going down for system halt NOW! At the syslog I just can find: Apr 15 18:40:59 nodea ccsd[16930]: Initial status:: Quorate Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Left: Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Joined: Apr 15 18:40:59 nodea openais[16936]: [CLM ] CLM CONFIGURATION CHANGE Apr 15 18:41:00 nodea openais[16936]: [CLM ] New Configuration: Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Left: Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Joined: Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) Apr 15 18:41:00 nodea openais[16936]: [SYNC ] This node is within the primary component and will provide service. Apr 15 18:41:00 nodea openais[16936]: [TOTEM] entering OPERATIONAL state. Apr 15 18:41:00 nodea openais[16936]: [CMAN ] quorum regained, resuming activity Apr 15 18:41:00 nodea openais[16936]: [CLM ] got nodejoin message 10.192.16.42 Apr 15 18:42:11 nodea fenced[16955]: nodeb not a cluster member after 25 sec post_join_delay Apr 15 18:42:11 nodea fenced[16955]: fencing node "nodeb" Apr 15 18:42:23 nodea fenced[16955]: fence "nodeb" success [root at nodea ~]# clustat Cluster Status for VCluster @ Thu Apr 15 18:55:23 2010 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ nodea 1 Online, Local nodeb 2 Offline Then when nodeb starts again, I try to start cman there to join the cluster... but it again fences "nodea": [root at nodeb ~]# clustat Could not connect to CMAN: No such file or directory [root at nodeb ~]# service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting qdiskd... done Starting daemons... done Starting fencing... (wait for 25secs again) done [ OK ] "nodea" gets rebooted: [root at nodea ~]# Broadcast message from root (Thu Apr 15 18:58:40 2010): The system is going down for system halt NOW! Apr 15 18:57:31 nodeb openais[11789]: [CLM ] Members Joined: Apr 15 18:57:31 nodeb openais[11789]: [CLM ] r(0) ip(10.192.16.44) Apr 15 18:57:31 nodeb openais[11789]: [SYNC ] This node is within the primary component and will provide service. Apr 15 18:57:31 nodeb openais[11789]: [TOTEM] entering OPERATIONAL state. Apr 15 18:57:31 nodeb openais[11789]: [CMAN ] quorum regained, resuming activity Apr 15 18:57:31 nodeb openais[11789]: [CLM ] got nodejoin message 10.192.16.44 Apr 15 18:57:34 nodeb qdiskd[10323]: Quorum Daemon Initializing Apr 15 18:57:34 nodeb qdiskd[10323]: Initialization failed Apr 15 18:58:42 nodeb fenced[11816]: nodea not a cluster member after 25 sec post_join_delay Apr 15 18:58:42 nodeb fenced[11816]: fencing node "nodea" Apr 15 18:58:54 nodeb fenced[11816]: fence "nodea" success And I can't get the two nodes, joining the cluster... I guess I'm missing something at the cluster.conf file??? I can't find what I'm making wrong. Thanks for any help! Alex Re -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jason_Henderson at Mitel.com Thu Apr 15 17:44:13 2010 From: Jason_Henderson at Mitel.com (Jason_Henderson at Mitel.com) Date: Thu, 15 Apr 2010 13:44:13 -0400 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <4BC7473D.9060809@gmx.es> Message-ID: Most likely the multicast packet communication between the 2 nodes is not getting through your network. linux-cluster-bounces at redhat.com wrote on 04/15/2010 01:05:01 PM: > Good afternoon, > I'm trying to form my first cluster of two nodes, using iLO fence > devices. I need some help because I can't find what I've missed. > My main problem is that the "service cman start" reboots the other > node and I can't form the two nodes cluster. > I'm using (at both nodea and nodeb, they are on the same VLAN and > pings each other ok): > > [root at nodea ~]# uname -a > Linux nodea 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 > x86_64 x86_64 x86_64 GNU/Linux > [root at nodea ~]# rpm -qa |grep cman > cman-2.0.115-1.el5_4.9 > > [root at nodea ~]# cat /etc/cluster/cluster.conf (nodeb has the same file) > > > > > > > > > > > > > > > > > > > > > > login="user" name="nodeaILO" passwd="hp"/> > login="user" name="nodebILO" passwd="hp"/> > > > > > > > > When I start the cman service, it hangs up for some time at the > "Starting fencing..." step and after those configured 25secs it > fences nodeb and reboots it. > [root at nodea ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... done > [ OK ] > > "nodeb" gets rebooted: > [root at nodeb ~]# > Broadcast message from root (Thu Apr 15 18:42:24 2010): > > The system is going down for system halt NOW! > > At the syslog I just can find: > Apr 15 18:40:59 nodea ccsd[16930]: Initial status:: Quorate > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] CLM CONFIGURATION CHANGE > Apr 15 18:41:00 nodea openais[16936]: [CLM ] New Configuration: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:41:00 nodea openais[16936]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:41:00 nodea openais[16936]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:41:00 nodea openais[16936]: [CLM ] got nodejoin message > 10.192.16.42 > Apr 15 18:42:11 nodea fenced[16955]: nodeb not a cluster member > after 25 sec post_join_delay > Apr 15 18:42:11 nodea fenced[16955]: fencing node "nodeb" > Apr 15 18:42:23 nodea fenced[16955]: fence "nodeb" success > > [root at nodea ~]# clustat > Cluster Status for VCluster @ Thu Apr 15 18:55:23 2010 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > nodea > 1 Online, Local > nodeb 2 Offline > > Then when nodeb starts again, I try to start cman there to join the > cluster... but it again fences "nodea": > [root at nodeb ~]# clustat > Could not connect to CMAN: No such file or directory > [root at nodeb ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting qdiskd... done > Starting daemons... done > Starting fencing... (wait for 25secs again) done > [ OK ] > "nodea" gets rebooted: > [root at nodea ~]# > Broadcast message from root (Thu Apr 15 18:58:40 2010): > > The system is going down for system halt NOW! > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] Members Joined: > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] r(0) ip(10.192.16.44) > Apr 15 18:57:31 nodeb openais[11789]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:57:31 nodeb openais[11789]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:57:31 nodeb openais[11789]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] got nodejoin message > 10.192.16.44 > Apr 15 18:57:34 nodeb qdiskd[10323]: Quorum Daemon Initializing > Apr 15 18:57:34 nodeb qdiskd[10323]: Initialization failed > Apr 15 18:58:42 nodeb fenced[11816]: nodea not a cluster member > after 25 sec post_join_delay > Apr 15 18:58:42 nodeb fenced[11816]: fencing node "nodea" > Apr 15 18:58:54 nodeb fenced[11816]: fence "nodea" success > > And I can't get the two nodes, joining the cluster... > I guess I'm missing something at the cluster.conf file??? I can't > find what I'm making wrong. > > Thanks for any help! > > Alex Re-- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Thu Apr 15 18:34:52 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Thu, 15 Apr 2010 14:34:52 -0400 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: References: <4BC7473D.9060809@gmx.es> Message-ID: <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> For two node clusters there's a convenient workaround: crossover cable. You'll need a spare Ethernet port but that's easier than getting certain switches to do multicast correctly. (At least in my experience.) From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason_Henderson at Mitel.com Sent: Thursday, April 15, 2010 1:44 PM To: linux clustering Cc: linux-cluster at redhat.com; linux-cluster-bounces at redhat.com Subject: Re: [Linux-cluster] Two node cluster,start CMAN fence the other node Most likely the multicast packet communication between the 2 nodes is not getting through your network. linux-cluster-bounces at redhat.com wrote on 04/15/2010 01:05:01 PM: > Good afternoon, > I'm trying to form my first cluster of two nodes, using iLO fence > devices. I need some help because I can't find what I've missed. > My main problem is that the "service cman start" reboots the other > node and I can't form the two nodes cluster. > I'm using (at both nodea and nodeb, they are on the same VLAN and > pings each other ok): > > [root at nodea ~]# uname -a > Linux nodea 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 > x86_64 x86_64 x86_64 GNU/Linux > [root at nodea ~]# rpm -qa |grep cman > cman-2.0.115-1.el5_4.9 > > [root at nodea ~]# cat /etc/cluster/cluster.conf (nodeb has the same file) > > > > > > > > > > > > > > > > > > > > > > login="user" name="nodeaILO" passwd="hp"/> > login="user" name="nodebILO" passwd="hp"/> > > > > > > > > When I start the cman service, it hangs up for some time at the > "Starting fencing..." step and after those configured 25secs it > fences nodeb and reboots it. > [root at nodea ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... done > [ OK ] > > "nodeb" gets rebooted: > [root at nodeb ~]# > Broadcast message from root (Thu Apr 15 18:42:24 2010): > > The system is going down for system halt NOW! > > At the syslog I just can find: > Apr 15 18:40:59 nodea ccsd[16930]: Initial status:: Quorate > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] CLM CONFIGURATION CHANGE > Apr 15 18:41:00 nodea openais[16936]: [CLM ] New Configuration: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:41:00 nodea openais[16936]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:41:00 nodea openais[16936]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:41:00 nodea openais[16936]: [CLM ] got nodejoin message > 10.192.16.42 > Apr 15 18:42:11 nodea fenced[16955]: nodeb not a cluster member > after 25 sec post_join_delay > Apr 15 18:42:11 nodea fenced[16955]: fencing node "nodeb" > Apr 15 18:42:23 nodea fenced[16955]: fence "nodeb" success > > [root at nodea ~]# clustat > Cluster Status for VCluster @ Thu Apr 15 18:55:23 2010 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > nodea > 1 Online, Local > nodeb 2 Offline > > Then when nodeb starts again, I try to start cman there to join the > cluster... but it again fences "nodea": > [root at nodeb ~]# clustat > Could not connect to CMAN: No such file or directory > [root at nodeb ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting qdiskd... done > Starting daemons... done > Starting fencing... (wait for 25secs again) done > [ OK ] > "nodea" gets rebooted: > [root at nodea ~]# > Broadcast message from root (Thu Apr 15 18:58:40 2010): > > The system is going down for system halt NOW! > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] Members Joined: > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] r(0) ip(10.192.16.44) > Apr 15 18:57:31 nodeb openais[11789]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:57:31 nodeb openais[11789]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:57:31 nodeb openais[11789]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] got nodejoin message > 10.192.16.44 > Apr 15 18:57:34 nodeb qdiskd[10323]: Quorum Daemon Initializing > Apr 15 18:57:34 nodeb qdiskd[10323]: Initialization failed > Apr 15 18:58:42 nodeb fenced[11816]: nodea not a cluster member > after 25 sec post_join_delay > Apr 15 18:58:42 nodeb fenced[11816]: fencing node "nodea" > Apr 15 18:58:54 nodeb fenced[11816]: fence "nodea" success > > And I can't get the two nodes, joining the cluster... > I guess I'm missing something at the cluster.conf file??? I can't > find what I'm making wrong. > > Thanks for any help! > > Alex Re-- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From are at gmx.es Fri Apr 16 11:25:03 2010 From: are at gmx.es (Alex Re) Date: Fri, 16 Apr 2010 13:25:03 +0200 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> Message-ID: <4BC8490F.4090704@gmx.es> Good morning, thanks for your replies! Multicast was definetively my problem. I couldn't use a crossed cable as suggested by Jeff, because these servers are blades, but after checking/configuring the IGMP properties at the switches ports, the cluster started working fine! Thanks again! Alex. On 04/15/2010 08:34 PM, Jeff Sturm wrote: > > For two node clusters there's a convenient workaround: crossover cable. > > You'll need a spare Ethernet port but that's easier than getting > certain switches to do multicast correctly. (At least in my experience.) > > *From:* linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] *On Behalf Of > *Jason_Henderson at Mitel.com > *Sent:* Thursday, April 15, 2010 1:44 PM > *To:* linux clustering > *Cc:* linux-cluster at redhat.com; linux-cluster-bounces at redhat.com > *Subject:* Re: [Linux-cluster] Two node cluster,start CMAN fence the > other node > > > Most likely the multicast packet communication between the 2 nodes is > not getting through your network. > > linux-cluster-bounces at redhat.com wrote on 04/15/2010 01:05:01 PM: > > > Good afternoon, > > I'm trying to form my first cluster of two nodes, using iLO fence > > devices. I need some help because I can't find what I've missed. > > My main problem is that the "service cman start" reboots the other > > node and I can't form the two nodes cluster. > > I'm using (at both nodea and nodeb, they are on the same VLAN and > > pings each other ok): > > > > [root at nodea ~]# uname -a > > Linux nodea 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 > > x86_64 x86_64 x86_64 GNU/Linux > > [root at nodea ~]# rpm -qa |grep cman > > cman-2.0.115-1.el5_4.9 > > > > [root at nodea ~]# cat /etc/cluster/cluster.conf (nodeb has the same file) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > login="user" name="nodeaILO" passwd="hp"/> > > > login="user" name="nodebILO" passwd="hp"/> > > > > > > > > > > > > > > > > When I start the cman service, it hangs up for some time at the > > "Starting fencing..." step and after those configured 25secs it > > fences nodeb and reboots it. > > [root at nodea ~]# service cman start > > Starting cluster: > > Loading modules... done > > Mounting configfs... done > > Starting ccsd... done > > Starting cman... done > > Starting daemons... done > > Starting fencing... done > > [ OK ] > > > > "nodeb" gets rebooted: > > [root at nodeb ~]# > > Broadcast message from root (Thu Apr 15 18:42:24 2010): > > > > The system is going down for system halt NOW! > > > > At the syslog I just can find: > > Apr 15 18:40:59 nodea ccsd[16930]: Initial status:: Quorate > > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Left: > > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Joined: > > Apr 15 18:40:59 nodea openais[16936]: [CLM ] CLM CONFIGURATION CHANGE > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] New Configuration: > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Left: > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Joined: > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > > Apr 15 18:41:00 nodea openais[16936]: [SYNC ] This node is within > > the primary component and will provide service. > > Apr 15 18:41:00 nodea openais[16936]: [TOTEM] entering OPERATIONAL > state. > > Apr 15 18:41:00 nodea openais[16936]: [CMAN ] quorum regained, > > resuming activity > > Apr 15 18:41:00 nodea openais[16936]: [CLM ] got nodejoin message > > 10.192.16.42 > > Apr 15 18:42:11 nodea fenced[16955]: nodeb not a cluster member > > after 25 sec post_join_delay > > Apr 15 18:42:11 nodea fenced[16955]: fencing node "nodeb" > > Apr 15 18:42:23 nodea fenced[16955]: fence "nodeb" success > > > > [root at nodea ~]# clustat > > Cluster Status for VCluster @ Thu Apr 15 18:55:23 2010 > > Member Status: Quorate > > > > Member Name ID > Status > > ------ ---- ---- > ------ > > nodea > > 1 Online, Local > > nodeb 2 > Offline > > > > Then when nodeb starts again, I try to start cman there to join the > > cluster... but it again fences "nodea": > > [root at nodeb ~]# clustat > > Could not connect to CMAN: No such file or directory > > [root at nodeb ~]# service cman start > > Starting cluster: > > Loading modules... done > > Mounting configfs... done > > Starting ccsd... done > > Starting cman... done > > Starting qdiskd... done > > Starting daemons... done > > Starting fencing... (wait for 25secs again) done > > [ OK ] > > "nodea" gets rebooted: > > [root at nodea ~]# > > Broadcast message from root (Thu Apr 15 18:58:40 2010): > > > > The system is going down for system halt NOW! > > > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] Members Joined: > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] r(0) ip(10.192.16.44) > > Apr 15 18:57:31 nodeb openais[11789]: [SYNC ] This node is within > > the primary component and will provide service. > > Apr 15 18:57:31 nodeb openais[11789]: [TOTEM] entering OPERATIONAL > state. > > Apr 15 18:57:31 nodeb openais[11789]: [CMAN ] quorum regained, > > resuming activity > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] got nodejoin message > > 10.192.16.44 > > Apr 15 18:57:34 nodeb qdiskd[10323]: Quorum Daemon Initializing > > Apr 15 18:57:34 nodeb qdiskd[10323]: Initialization failed > > Apr 15 18:58:42 nodeb fenced[11816]: nodea not a cluster member > > after 25 sec post_join_delay > > Apr 15 18:58:42 nodeb fenced[11816]: fencing node "nodea" > > Apr 15 18:58:54 nodeb fenced[11816]: fence "nodea" success > > > > And I can't get the two nodes, joining the cluster... > > I guess I'm missing something at the cluster.conf file??? I can't > > find what I'm making wrong. > > > > Thanks for any help! > > > > Alex Re-- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pamadio at redhat.com Fri Apr 16 11:52:55 2010 From: pamadio at redhat.com (Pierre Amadio) Date: Fri, 16 Apr 2010 13:52:55 +0200 Subject: [Linux-cluster] does fencing require a dedicated network ? Message-ID: <20100416115255.GA21735@x60s.localdomain> Hi there ! A customer is planning to use our cluster suite (most probably with 2 nodes cluster running RHEL5 cluster suite). The fencing mechanism will be some rsaII cards. Right now, the rsaII cards are on a another network than the one used for heartbeat. The problem is, it is dedicated to an admin team wich does not want the cluster suite solution to have access to this specific network. The customer does not know yet if he will accept to create a new network and put the rsaII cards (and only them) on it. Would it be a problem as far as the cluster suite and the support coverage is concerned if those devices were available on the same network as the one used for heatbeat ? I would be tempted to say yes to both, because i dont know how one node could fence the other in case there was a communication problem between the 2 node on the heartbeat. The following article seems to states it can be done (even without any quorum disk): http://kbase.redhat.com/faq/docs/DOC-16712 The article is still unverified though. Is it ok to give the customer a green flag using a two node cluster (without quorum disk) where heartbeat and fencing use the same network ? -- Pierre Amadio Technical Account Manager mobile: +33 685 774 477 Red Hat France SARL, 1 rue du G?n?ral Leclerc, 92047 Paris La D?fense Cedex, France. Siret n? 421 199 464 00064 From jose.neto at liber4e.com Fri Apr 16 12:33:48 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Fri, 16 Apr 2010 12:33:48 -0000 (GMT) Subject: [Linux-cluster] fs.sh status weird timings Message-ID: Hellos I'm doing some tests with SAN/LVM/FS and noticed that the status on the FS resources aren't much accurate, the interval varies a lot. I placed a logger command on the fs.sh just before the status workings and this is what I get on /var/log/messages: ( in the meantime I noticed there's a new rmanager release on rhel 5.5 and will try it out ) Apr 16 12:23:17 dc2-x6250-a logger: FS Check neto Apr 16 12:27:47 dc2-x6250-a logger: FS Check neto Apr 16 12:34:07 dc2-x6250-a logger: FS Check neto Apr 16 12:44:07 dc2-x6250-a logger: FS Check neto Apr 16 13:03:07 dc2-x6250-a logger: FS Check neto Apr 16 13:06:17 dc2-x6250-a logger: FS Check neto Apr 16 13:20:17 dc2-x6250-a logger: FS Check neto Apr 16 13:20:47 dc2-x6250-a logger: FS Check neto Apr 16 13:21:07 dc2-x6250-a logger: FS Check neto Apr 16 13:21:17 dc2-x6250-a logger: FS Check neto Apr 16 13:21:37 dc2-x6250-a logger: FS Check neto Apr 16 13:22:27 dc2-x6250-a logger: FS Check neto Apr 16 13:22:37 dc2-x6250-a logger: FS Check neto Apr 16 13:23:07 dc2-x6250-a logger: FS Check neto Apr 16 13:26:46 dc2-x6250-a logger: FS Check neto Apr 16 13:28:27 dc2-x6250-a logger: FS Check neto Apr 16 13:29:00 dc2-x6250-a logger: FS Check neto Apr 16 13:29:17 dc2-x6250-a logger: FS Check neto Apr 16 13:29:39 dc2-x6250-a logger: FS Check neto Apr 16 13:30:16 dc2-x6250-a logger: FS Check neto Apr 16 13:30:27 dc2-x6250-a logger: FS Check neto Apr 16 13:30:36 dc2-x6250-a logger: FS Check neto Apr 16 13:30:37 dc2-x6250-a logger: FS Check neto Apr 16 13:30:57 dc2-x6250-a logger: FS Check neto Apr 16 13:31:59 dc2-x6250-a logger: FS Check neto Apr 16 13:33:00 dc2-x6250-a logger: FS Check neto Apr 16 13:33:06 dc2-x6250-a logger: FS Check neto Apr 16 13:33:16 dc2-x6250-a logger: FS Check neto ################################################# fs.sh status|monitor) logger "FS Check neto" ################################################# These are the resource/service been used: ########################################################################### Timings from fs.sh script Agent: fs.sh Flags: init_on_add destroy_on_delete Attributes: name [ primary ] mountpoint [ unique required ] device [ unique required ] fstype force_unmount quick_status self_fence nfslock [ inherit ] default="nfslock" fsid force_fsck options Actions: start Timeout (hint): 900 seconds stop Timeout (hint): 30 seconds status Timeout (hint): 10 seconds Check Interval: 60 seconds monitor Timeout (hint): 10 seconds Check Interval: 60 seconds status Timeout (hint): 30 seconds OCF Check Depth (status/monitor): 10 seconds Check Interval: 30 seconds monitor Timeout (hint): 30 seconds OCF Check Depth (status/monitor): 10 seconds Check Interval: 30 seconds status Timeout (hint): 30 seconds OCF Check Depth (status/monitor): 20 seconds Check Interval: 60 seconds monitor Timeout (hint): 30 seconds OCF Check Depth (status/monitor): 20 seconds Check Interval: 60 seconds meta-data Timeout (hint): 5 seconds verify-all Timeout (hint): 5 seconds Explicitly defined child resource types: fs [ startlevel = 1 stoplevel = 3 ] clusterfs [ startlevel = 1 stoplevel = 3 ] nfsexport [ startlevel = 3 stoplevel = 1 ] From cmaiolino at redhat.com Fri Apr 16 14:56:00 2010 From: cmaiolino at redhat.com (Carlos Maiolino) Date: Fri, 16 Apr 2010 11:56:00 -0300 Subject: [Linux-cluster] does fencing require a dedicated network ? In-Reply-To: <20100416115255.GA21735@x60s.localdomain> References: <20100416115255.GA21735@x60s.localdomain> Message-ID: <20100416145559.GA2407@andromeda.usersys.redhat.com> Hi Pierre On Fri, Apr 16, 2010 at 01:52:55PM +0200, Pierre Amadio wrote: > Hi there ! > > A customer is planning to use our cluster suite (most probably with 2 > nodes cluster running RHEL5 cluster suite). > > The fencing mechanism will be some rsaII cards. > > Right now, the rsaII cards are on a another network than the one used > for heartbeat. The problem is, it is dedicated to an admin team wich > does not want the cluster suite solution to have access to this specific > network. The customer does not know yet if he will accept to create a > new network and put the rsaII cards (and only them) on it. > > Would it be a problem as far as the cluster suite and the support > coverage is concerned if those devices were available on the same > network as the one used for heatbeat ? > Yes and no, depends on the network design > I would be tempted to say yes to both, because i dont know how one node > could fence the other in case there was a communication problem between > the 2 node on the heartbeat. Depends on the problem occurrs, if both nodes (heartbeat and fence) are connected on the same switch without redundancy, and the switch fails, the cluster nodes cannot fence each other (and we still have a split-brain), but if the problem occurs in a node communication (cable problem, NIC problem, switch-port, etc) the healthy node still can send a fence signal to the failed node. So, is a good idea to have or a redundant design (e.g. bonding connected a two separate switchs), or, separate the fence network and use a quorum disk system. > > The following article seems to states it can be done (even without any > quorum disk): > http://kbase.redhat.com/faq/docs/DOC-16712 > > The article is still unverified though. > > Is it ok to give the customer a green flag using a two node cluster > (without quorum disk) where heartbeat and fencing use the same network ? > > > -- > Pierre Amadio > Technical Account Manager mobile: +33 685 774 477 > Red Hat France SARL, 1 rue du G?n?ral Leclerc, 92047 Paris La D?fense > Cedex, France. Siret n? 421 199 464 00064 > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- --- Best Regards Carlos Eduardo Maiolino Software Maintenance Engineer Red Hat - Global Support Services From Chris.Jankowski at hp.com Fri Apr 16 15:00:57 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Fri, 16 Apr 2010 15:00:57 +0000 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <4BC8490F.4090704@gmx.es> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> <4BC8490F.4090704@gmx.es> Message-ID: <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> Alex, What exactly did you configure for IGMP? Did you also separate the cluster interconnect traffic in its own VLAN? Thanks and regards, Chris ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex Re Sent: Friday, 16 April 2010 21:25 To: linux clustering Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Good morning, thanks for your replies! Multicast was definetively my problem. I couldn't use a crossed cable as suggested by Jeff, because these servers are blades, but after checking/configuring the IGMP properties at the switches ports, the cluster started working fine! Thanks again! Alex. On 04/15/2010 08:34 PM, Jeff Sturm wrote: For two node clusters there's a convenient workaround: crossover cable. You'll need a spare Ethernet port but that's easier than getting certain switches to do multicast correctly. (At least in my experience.) From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jason_Henderson at Mitel.com Sent: Thursday, April 15, 2010 1:44 PM To: linux clustering Cc: linux-cluster at redhat.com; linux-cluster-bounces at redhat.com Subject: Re: [Linux-cluster] Two node cluster,start CMAN fence the other node Most likely the multicast packet communication between the 2 nodes is not getting through your network. linux-cluster-bounces at redhat.com wrote on 04/15/2010 01:05:01 PM: > Good afternoon, > I'm trying to form my first cluster of two nodes, using iLO fence > devices. I need some help because I can't find what I've missed. > My main problem is that the "service cman start" reboots the other > node and I can't form the two nodes cluster. > I'm using (at both nodea and nodeb, they are on the same VLAN and > pings each other ok): > > [root at nodea ~]# uname -a > Linux nodea 2.6.18-164.15.1.el5 #1 SMP Wed Mar 17 11:30:06 EDT 2010 > x86_64 x86_64 x86_64 GNU/Linux > [root at nodea ~]# rpm -qa |grep cman > cman-2.0.115-1.el5_4.9 > > [root at nodea ~]# cat /etc/cluster/cluster.conf (nodeb has the same file) > > > > > > > > > > > > > > > > > > > > > > login="user" name="nodeaILO" passwd="hp"/> > login="user" name="nodebILO" passwd="hp"/> > > > > > > > > When I start the cman service, it hangs up for some time at the > "Starting fencing..." step and after those configured 25secs it > fences nodeb and reboots it. > [root at nodea ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting daemons... done > Starting fencing... done > [ OK ] > > "nodeb" gets rebooted: > [root at nodeb ~]# > Broadcast message from root (Thu Apr 15 18:42:24 2010): > > The system is going down for system halt NOW! > > At the syslog I just can find: > Apr 15 18:40:59 nodea ccsd[16930]: Initial status:: Quorate > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:40:59 nodea openais[16936]: [CLM ] CLM CONFIGURATION CHANGE > Apr 15 18:41:00 nodea openais[16936]: [CLM ] New Configuration: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Left: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] Members Joined: > Apr 15 18:41:00 nodea openais[16936]: [CLM ] r(0) ip(10.192.16.42) > Apr 15 18:41:00 nodea openais[16936]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:41:00 nodea openais[16936]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:41:00 nodea openais[16936]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:41:00 nodea openais[16936]: [CLM ] got nodejoin message > 10.192.16.42 > Apr 15 18:42:11 nodea fenced[16955]: nodeb not a cluster member > after 25 sec post_join_delay > Apr 15 18:42:11 nodea fenced[16955]: fencing node "nodeb" > Apr 15 18:42:23 nodea fenced[16955]: fence "nodeb" success > > [root at nodea ~]# clustat > Cluster Status for VCluster @ Thu Apr 15 18:55:23 2010 > Member Status: Quorate > > Member Name ID Status > ------ ---- ---- ------ > nodea > 1 Online, Local > nodeb 2 Offline > > Then when nodeb starts again, I try to start cman there to join the > cluster... but it again fences "nodea": > [root at nodeb ~]# clustat > Could not connect to CMAN: No such file or directory > [root at nodeb ~]# service cman start > Starting cluster: > Loading modules... done > Mounting configfs... done > Starting ccsd... done > Starting cman... done > Starting qdiskd... done > Starting daemons... done > Starting fencing... (wait for 25secs again) done > [ OK ] > "nodea" gets rebooted: > [root at nodea ~]# > Broadcast message from root (Thu Apr 15 18:58:40 2010): > > The system is going down for system halt NOW! > > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] Members Joined: > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] r(0) ip(10.192.16.44) > Apr 15 18:57:31 nodeb openais[11789]: [SYNC ] This node is within > the primary component and will provide service. > Apr 15 18:57:31 nodeb openais[11789]: [TOTEM] entering OPERATIONAL state. > Apr 15 18:57:31 nodeb openais[11789]: [CMAN ] quorum regained, > resuming activity > Apr 15 18:57:31 nodeb openais[11789]: [CLM ] got nodejoin message > 10.192.16.44 > Apr 15 18:57:34 nodeb qdiskd[10323]: Quorum Daemon Initializing > Apr 15 18:57:34 nodeb qdiskd[10323]: Initialization failed > Apr 15 18:58:42 nodeb fenced[11816]: nodea not a cluster member > after 25 sec post_join_delay > Apr 15 18:58:42 nodeb fenced[11816]: fencing node "nodea" > Apr 15 18:58:54 nodeb fenced[11816]: fence "nodea" success > > And I can't get the two nodes, joining the cluster... > I guess I'm missing something at the cluster.conf file??? I can't > find what I'm making wrong. > > Thanks for any help! > > Alex Re-- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From are at gmx.es Fri Apr 16 15:35:35 2010 From: are at gmx.es (Alex Re) Date: Fri, 16 Apr 2010 17:35:35 +0200 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> <4BC8490F.4090704@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> Message-ID: <4BC883C7.7060302@gmx.es> Hi Chris, for the switches ports stuff check out this url: http://www.openais.org/doku.php?id=faq:cisco_switches We have finally configured an internal (private) VLAN joining there one NIC of each blade server. Now all cluster related traffic goes through those interfaces (eth2 at both servers in our case), including the traffic generated by the lock_dlm of the GFS2 filesystem, just created. To check multicast connectivity, these are two very useful commands, "nc -u -vvn -z 5405" to generate some multicast udp traffic and "tcpdump -i eth2 ether multicast" to check it from the other node. (eth2 in my particular case, of course). I have been playing a little with the lock_dlm, but here is how my cluster.conf looks now: Next thing to add... I'm going to play a little with the quorum devices. Hope it helps! Alex On 04/16/2010 05:00 PM, Jankowski, Chris wrote: > eparate the cluster interconne -------------- next part -------------- An HTML attachment was scrubbed... URL: From jose.neto at liber4e.com Fri Apr 16 16:15:10 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Fri, 16 Apr 2010 16:15:10 -0000 (GMT) Subject: [Linux-cluster] fs.sh status weird timings( disregard ) In-Reply-To: References: Message-ID: ignore this previous request I was monitoring only the tag on messages and there's a repeated info about the tag timings are ok for this script > Hellos > > I'm doing some tests with SAN/LVM/FS and noticed that the status on the FS > resources aren't much accurate, the interval varies a lot. > I placed a logger command on the fs.sh just before the status workings and > this is what I get on /var/log/messages: ( in the meantime I noticed > there's a new rmanager release on rhel 5.5 and will try it out ) > > Apr 16 12:23:17 dc2-x6250-a logger: FS Check neto > Apr 16 12:27:47 dc2-x6250-a logger: FS Check neto > Apr 16 12:34:07 dc2-x6250-a logger: FS Check neto > Apr 16 12:44:07 dc2-x6250-a logger: FS Check neto > Apr 16 13:03:07 dc2-x6250-a logger: FS Check neto > Apr 16 13:06:17 dc2-x6250-a logger: FS Check neto > Apr 16 13:20:17 dc2-x6250-a logger: FS Check neto > Apr 16 13:20:47 dc2-x6250-a logger: FS Check neto > Apr 16 13:21:07 dc2-x6250-a logger: FS Check neto > Apr 16 13:21:17 dc2-x6250-a logger: FS Check neto > Apr 16 13:21:37 dc2-x6250-a logger: FS Check neto > Apr 16 13:22:27 dc2-x6250-a logger: FS Check neto > Apr 16 13:22:37 dc2-x6250-a logger: FS Check neto > Apr 16 13:23:07 dc2-x6250-a logger: FS Check neto > Apr 16 13:26:46 dc2-x6250-a logger: FS Check neto > Apr 16 13:28:27 dc2-x6250-a logger: FS Check neto > Apr 16 13:29:00 dc2-x6250-a logger: FS Check neto > Apr 16 13:29:17 dc2-x6250-a logger: FS Check neto > Apr 16 13:29:39 dc2-x6250-a logger: FS Check neto > Apr 16 13:30:16 dc2-x6250-a logger: FS Check neto > Apr 16 13:30:27 dc2-x6250-a logger: FS Check neto > Apr 16 13:30:36 dc2-x6250-a logger: FS Check neto > Apr 16 13:30:37 dc2-x6250-a logger: FS Check neto > Apr 16 13:30:57 dc2-x6250-a logger: FS Check neto > Apr 16 13:31:59 dc2-x6250-a logger: FS Check neto > Apr 16 13:33:00 dc2-x6250-a logger: FS Check neto > Apr 16 13:33:06 dc2-x6250-a logger: FS Check neto > Apr 16 13:33:16 dc2-x6250-a logger: FS Check neto > > > > ################################################# > fs.sh > > status|monitor) > logger "FS Check neto" > > > ################################################# > These are the resource/service been used: > > self_fence="1"/> > force_unmount="1" fsid="2073" fstype="ext3" > mountpoint="/app/oracle/jura/archive" name="ora_jura_arch" > self_fence="1"/> > force_unmount="1" fsid="2074" fstype="ext3" > mountpoint="/app/oracle/jura/redo" name="ora_jura_redo" self_fence="1"/> > force_fsck="0" force_unmount="1" fsid="2075" fstype="ext3" > mountpoint="/app/oracle/jura/data" name="ora_jura_data" self_fence="1"/> > force_unmount="1" fsid="2076" fstype="ext3" > mountpoint="/app/oracle/jura/export" name="ora_jura_export" > self_fence="1"/> > > recovery="relocate"> > > > > > > > > > > > > ########################################################################### > Timings from fs.sh script > > Agent: fs.sh > Flags: init_on_add destroy_on_delete > Attributes: > name [ primary ] > mountpoint [ unique required ] > device [ unique required ] > fstype > force_unmount > quick_status > self_fence > nfslock [ inherit ] default="nfslock" > fsid > force_fsck > options > Actions: > start > Timeout (hint): 900 seconds > stop > Timeout (hint): 30 seconds > status > Timeout (hint): 10 seconds > Check Interval: 60 seconds > monitor > Timeout (hint): 10 seconds > Check Interval: 60 seconds > status > Timeout (hint): 30 seconds > OCF Check Depth (status/monitor): 10 seconds > Check Interval: 30 seconds > monitor > Timeout (hint): 30 seconds > OCF Check Depth (status/monitor): 10 seconds > Check Interval: 30 seconds > status > Timeout (hint): 30 seconds > OCF Check Depth (status/monitor): 20 seconds > Check Interval: 60 seconds > monitor > Timeout (hint): 30 seconds > OCF Check Depth (status/monitor): 20 seconds > Check Interval: 60 seconds > meta-data > Timeout (hint): 5 seconds > verify-all > Timeout (hint): 5 seconds > Explicitly defined child resource types: > fs [ startlevel = 1 stoplevel = 3 ] > clusterfs [ startlevel = 1 stoplevel = 3 ] > nfsexport [ startlevel = 3 stoplevel = 1 ] > From Chris.Jankowski at hp.com Sat Apr 17 00:39:54 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Sat, 17 Apr 2010 00:39:54 +0000 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <4BC883C7.7060302@gmx.es> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> <4BC8490F.4090704@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> <4BC883C7.7060302@gmx.es> Message-ID: <036B68E61A28CA49AC2767596576CD596B7F266A0D@GVW1113EXC.americas.hpqcorp.net> Alex, 1. Thank you very much. The Cisco setup is very useful and the commands for testing multicast as well. 2 Loking at your cluster.conf, I would have thought that any limits on dlm and gfs lock rates are counterproductive in the days of multicore CPUs and GbE. They should be unlimited in my opinion. Under high load the limiting factor will be saturation of one core by gfs control daemon. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex Re Sent: Saturday, 17 April 2010 01:36 To: linux clustering Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Hi Chris, for the switches ports stuff check out this url: http://www.openais.org/doku.php?id=faq:cisco_switches We have finally configured an internal (private) VLAN joining there one NIC of each blade server. Now all cluster related traffic goes through those interfaces (eth2 at both servers in our case), including the traffic generated by the lock_dlm of the GFS2 filesystem, just created. To check multicast connectivity, these are two very useful commands, "nc -u -vvn -z 5405" to generate some multicast udp traffic and "tcpdump -i eth2 ether multicast" to check it from the other node. (eth2 in my particular case, of course). I have been playing a little with the lock_dlm, but here is how my cluster.conf looks now: Next thing to add... I'm going to play a little with the quorum devices. Hope it helps! Alex On 04/16/2010 05:00 PM, Jankowski, Chris wrote: eparate the cluster interconne -------------- next part -------------- An HTML attachment was scrubbed... URL: From celsowebber at yahoo.com Mon Apr 19 12:44:12 2010 From: celsowebber at yahoo.com (Celso K. Webber) Date: Mon, 19 Apr 2010 05:44:12 -0700 (PDT) Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <036B68E61A28CA49AC2767596576CD596B7F266A0D@GVW1113EXC.americas.hpqcorp.net> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> <4BC8490F.4090704@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> <4BC883C7.7060302@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F266A0D@GVW1113EXC.americas.hpqcorp.net> Message-ID: <474171.4931.qm@web111718.mail.gq1.yahoo.com> Hello Chris, Regarding number 2 below, in fact the default limits are "100", so Alex's configuration of "500" should increase the performance. Of couse there can be configured for unlimited, but this is something one should decide by him(her)self according to the environment. Please see this link: http://www.linuxdynasty.org/howto-increase-gfs2-performance-in-a-cluster.html Regards, Celso. ________________________________ From: "Jankowski, Chris" To: linux clustering Sent: Fri, April 16, 2010 9:39:54 PM Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Alex, 1. Thank you very much. The Cisco setup is very useful and the commands for testing multicast as well. 2 Loking at your cluster.conf, I would have thought that any limits on dlm and gfs lock rates are counterproductive in the days of multicore CPUs and GbE. They should be unlimited in my opinion. Under high load the limiting factor will be saturation of one core by gfs control daemon. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex > Re >Sent: Saturday, 17 April 2010 01:36 >To: linux > clustering >Subject: Re: [Linux-cluster] Two node cluster, start CMAN > fence the other node > >Hi Chris, > >for the switches ports stuff check out this > url: >http://www.openais.org/doku.php?id=faq:cisco_switches > >We > have finally configured an internal (private) VLAN joining there one NIC of > each blade server. Now all cluster related traffic goes through those > interfaces (eth2 at both servers in our case), including the traffic generated > by the lock_dlm of the GFS2 filesystem, just created. > >To check > multicast connectivity, these are two very useful commands, "nc -u -vvn -z > 5405" to generate some multicast udp traffic and "tcpdump > -i eth2 ether multicast" to check it from the other node. (eth2 in my > particular case, of course). > >I have been playing a little with the > lock_dlm, but here is how my cluster.conf looks now: > > version="1.0"?> > name="VCluster"> > post_join_delay="25"/> > > > name="nodeaint" nodeid="1" votes="1"> > > interface="eth2"/> > > > > name="1"> > > name="nodeaiLO"/> > > > > > > > > votes="1"> > > > > > > name="1"> > > name="nodebiLO"/> > > > > > > > > > two_node="1"> > addr="239.0.0.1"/> > > > > > login="user" name="nodeaiLO" passwd="hp"/> > > login="user" name="nodebiLO" passwd="hp"/> > > > > > > > > > > plock_rate_limit="500"/> > plock_rate_limit="500"/> > > >Next > thing to add... I'm going to play a little with the quorum > devices. > >Hope it helps! > >Alex > >On 04/16/2010 05:00 PM, > Jankowski, Chris wrote: > >eparate the cluster >>interconne -------------- next part -------------- An HTML attachment was scrubbed... URL: From jose.neto at liber4e.com Mon Apr 19 14:07:39 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Mon, 19 Apr 2010 14:07:39 -0000 (GMT) Subject: [Linux-cluster] fs.sh status hangs after device failures Message-ID: Hellos Im testing SAN under Multipath failures and founding a behavior on the fs.sh that is not what I wanned. On simulating a SAN failure either with portdown on the san switch or on the OS ( echo offline > /sys/block/$DEVICE/device/state ) the fs.sh status script doesn't give back an error. I looked at the script and think it hangs on the ls or touch test (depends on timings ) In fact if I issue an ls/touch on the failed mountpoints it hangs forever. If I fail the devices with /sys/block/$DEVICE/device/delete then the touch test returns an error and service switches. I found on redhat doc a reference for a parameter: remove_on_dev_loss http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/html/Online_Storage_Reconfiguration_Guide/modifying-link-loss-behavior.html I set it echo 1 > /sys/module/scsi_transport_fc/parameters/remove_on_dev_loss but didn't notice any changes Any nice suggestions ? Thanks Jose From Chris.Jankowski at hp.com Mon Apr 19 14:26:23 2010 From: Chris.Jankowski at hp.com (Jankowski, Chris) Date: Mon, 19 Apr 2010 14:26:23 +0000 Subject: [Linux-cluster] Two node cluster, start CMAN fence the other node In-Reply-To: <474171.4931.qm@web111718.mail.gq1.yahoo.com> References: <4BC7473D.9060809@gmx.es> <64D0546C5EBBD147B75DE133D798665F055D8F80@hugo.eprize.local> <4BC8490F.4090704@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F2669AC@GVW1113EXC.americas.hpqcorp.net> <4BC883C7.7060302@gmx.es> <036B68E61A28CA49AC2767596576CD596B7F266A0D@GVW1113EXC.americas.hpqcorp.net> <474171.4931.qm@web111718.mail.gq1.yahoo.com> Message-ID: <036B68E61A28CA49AC2767596576CD596B7F446518@GVW1113EXC.americas.hpqcorp.net> Celso, What would this limit buy you? Honestly, I cannot see any rationale for this limit on a modern multicore, multi CPU server. Note that there is a real *physical* limit that will activate as the load increases and that is one core being fully utilised by gfs control daemon. Even with this core saturated by the gfs control daemon the cluster works just fine and processes > 5,000 IO/s. The limit made sense when you had 300 MHz single core, single CPU servers. You could kill a server, if you were not careful. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Celso K. Webber Sent: Monday, 19 April 2010 22:44 To: linux clustering Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Hello Chris, Regarding number 2 below, in fact the default limits are "100", so Alex's configuration of "500" should increase the performance. Of couse there can be configured for unlimited, but this is something one should decide by him(her)self according to the environment. Please see this link: http://www.linuxdynasty.org/howto-increase-gfs2-performance-in-a-cluster.html Regards, Celso. ________________________________ From: "Jankowski, Chris" To: linux clustering Sent: Fri, April 16, 2010 9:39:54 PM Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Alex, 1. Thank you very much. The Cisco setup is very useful and the commands for testing multicast as well. 2 Loking at your cluster.conf, I would have thought that any limits on dlm and gfs lock rates are counterproductive in the days of multicore CPUs and GbE. They should be unlimited in my opinion. Under high load the limiting factor will be saturation of one core by gfs control daemon. Regards, Chris Jankowski ________________________________ From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alex Re Sent: Saturday, 17 April 2010 01:36 To: linux clustering Subject: Re: [Linux-cluster] Two node cluster, start CMAN fence the other node Hi Chris, for the switches ports stuff check out this url: http://www.openais.org/doku.php?id=faq:cisco_switches We have finally configured an internal (private) VLAN joining there one NIC of each blade server. Now all cluster related traffic goes through those interfaces (eth2 at both servers in our case), including the traffic generated by the lock_dlm of the GFS2 filesystem, just created. To check multicast connectivity, these are two very useful commands, "nc -u -vvn -z 5405" to generate some multicast udp traffic and "tcpdump -i eth2 ether multicast" to check it from the other node. (eth2 in my particular case, of course). I have been playing a little with the lock_dlm, but here is how my cluster.conf looks now: Next thing to add... I'm going to play a little with the quorum devices. Hope it helps! Alex On 04/16/2010 05:00 PM, Jankowski, Chris wrote: eparate the cluster interconne -------------- next part -------------- An HTML attachment was scrubbed... URL: From lhh at redhat.com Mon Apr 19 14:52:30 2010 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 19 Apr 2010 10:52:30 -0400 Subject: [Linux-cluster] fs.sh status hangs after device failures In-Reply-To: References: Message-ID: <1271688750.11980.6.camel@localhost.localdomain> On Mon, 2010-04-19 at 14:07 +0000, jose nuno neto wrote: > Hellos > > Im testing SAN under Multipath failures and founding a behavior on the > fs.sh that is not what I wanned. > > On simulating a SAN failure either with portdown on the san switch or on > the OS ( echo offline > /sys/block/$DEVICE/device/state ) the fs.sh status > script doesn't give back an error. > > I looked at the script and think it hangs on the ls or touch test (depends > on timings ) > In fact if I issue an ls/touch on the failed mountpoints it hangs forever. Set multipath configuration to "no_path_retry fail" > If I fail the devices with /sys/block/$DEVICE/device/delete then the touch > test returns an error and service switches. Right. -- Lon From lhh at redhat.com Mon Apr 19 14:54:58 2010 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 19 Apr 2010 14:54:58 +0000 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: References: Message-ID: <1271688898.11980.8.camel@localhost.localdomain> On Wed, 2010-04-14 at 19:07 +0200, ESGLinux wrote: > Hi All, > > > I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and > I have configured the fence nodes this way: > > > login="Administrator" name="ILONODE2" passwd="xxxx"/> > > > the problem is that I have run the command fence_node NODE2 and I have > seen the halt message and it hasn?t restarted > It should be 'immediate power off'; sounds like you need to disable acpid. Also, take a look at: https://bugzilla.redhat.com/show_bug.cgi?id=507514 -- Lon From lhh at redhat.com Mon Apr 19 15:04:08 2010 From: lhh at redhat.com (Lon Hohberger) Date: Mon, 19 Apr 2010 11:04:08 -0400 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: References: Message-ID: <1271689448.11980.15.camel@localhost.localdomain> On Wed, 2010-04-14 at 14:34 -0400, My LinuxHAList wrote: > > === Working on a new solution === > > > I'm working on a solution for this by a kdump_pre script. > When node1 panic'ed, before kdumping, it would contact node2 so that > node2 will attempt to take over the service. > > > At node2, I found running at node1 and issue: > clusvcadm -r > > > Because of node1's state (it is kdumping), the command just hangs and > it did not manage to cut down the service down time. > > > What can I do at node2 to forcefully take over the service from node1 > after node2 is contacted by node1 at kdump_pre stage ? > There's a bugzilla open about this -- you should check out https://bugzilla.redhat.com/show_bug.cgi?id=461948 There's even a design; just no code at this point. -- Lon From jose.neto at liber4e.com Mon Apr 19 15:30:39 2010 From: jose.neto at liber4e.com (jose nuno neto) Date: Mon, 19 Apr 2010 15:30:39 -0000 (GMT) Subject: [Linux-cluster] fs.sh status hangs after device failures In-Reply-To: <1271688750.11980.6.camel@localhost.localdomain> References: <1271688750.11980.6.camel@localhost.localdomain> Message-ID: <15d7b0af5e33505f5f9100f36a10fa5f.squirrel@fela.liber4e.com> > On Mon, 2010-04-19 at 14:07 +0000, jose nuno neto wrote: >> Hellos >> >> Im testing SAN under Multipath failures and founding a behavior on the >> fs.sh that is not what I wanned. >> >> On simulating a SAN failure either with portdown on the san switch or on >> the OS ( echo offline > /sys/block/$DEVICE/device/state ) the fs.sh >> status >> script doesn't give back an error. >> >> I looked at the script and think it hangs on the ls or touch test >> (depends >> on timings ) >> In fact if I issue an ls/touch on the failed mountpoints it hangs >> forever. > > Set multipath configuration to "no_path_retry fail" I have it: blacklist { wwid SSun_VOL0_266DCF4A wwid SSun_VOL0_5875CF4A devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" } defaults { user_friendly_names yes bindings_file /etc/multipath/bindings } devices { device { vendor "HITACHI" product "OPEN-V" path_grouping_policy multibus failback immediate no_path_retry fail } device { vendor "IET" product "VIRTUAL-DISK" path_checker tur path_grouping_policy failover failback immediate no_path_retry fail } } > > >> If I fail the devices with /sys/block/$DEVICE/device/delete then the >> touch >> test returns an error and service switches. > > Right. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From antoine.samson at etiam.com Mon Apr 19 16:06:35 2010 From: antoine.samson at etiam.com (Antoine Samson) Date: Mon, 19 Apr 2010 18:06:35 +0200 Subject: [Linux-cluster] NFS timeout once service has failover Message-ID: <4BCC7F8B.2050805@etiam.com> I have a NFS clustered two nodes service using a VIP. NFS clients are setup with following options: rw,timeo=10,retrans=3,retry=1,soft,intr When NFS service triggers from one node to another, NFS clients can no longer acces NFS mount (sometimes it just comes back after a long period of time, much more than 90s NFS gracefull period, sometimes not). NFS clients reports: xxxxx kernel: nfs: server xxxxxxxxx not responding, timed out tcpdump shows that NFS server is responding (so there should not be any arp problem, ping comes back up as soon as NFS service has been started on new node) Clients and servers are 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Thanks for your help, -- Antoine From deJongm at TEOCO.com Mon Apr 19 18:21:35 2010 From: deJongm at TEOCO.com (de Jong, Mark-Jan) Date: Mon, 19 Apr 2010 14:21:35 -0400 Subject: [Linux-cluster] Ondisk and fsck bitmaps differ at block XXXXXX Message-ID: <5E3DCAE61C95FA4397679425D7275D26382C8B18@HQ-MX03.us.teo.earth> Hello, After running our two node cluster concurrently for a number of days, I wanted to take a look at the health of the underlying GFS2 file system. And although we didn't run into any problems during or testing, I was surprised to still see hundreds of thousands, if not millions, of the following message when running and gfs2_fsck on the file system: Ondisk and fsck bitmaps differ at block 86432030 (0x526d91e) Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) Metadata type is 0 (free) Succeeded. I assume this is not good, although processes reading and writing to/from the filesystem seem to be running smoothly. I ran the fsck days after the last write operation on the cluster. I'm currently running the following on Centos 5.4: kernel-2.6.18-164.15.1.el5 cman-2.0.115-1.el5_4.9 lvm2-cluster-2.02.46-8.el5_4.1 The GFS2 file system is running on top of a clustered LVM partition. Any input would be greatly appreciated. Thanks, Mark de Jong -------------- next part -------------- An HTML attachment was scrubbed... URL: From linux at alteeve.com Mon Apr 19 18:48:21 2010 From: linux at alteeve.com (Digimer) Date: Mon, 19 Apr 2010 14:48:21 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device Message-ID: <4BCCA575.60806@alteeve.com> Hi Clustering folks! I wanted to announce a new open hardware, open source cluster fence device: Node Assassin - http://nodeassassin.org After four months and a lot of help from friends at http://hacklab.to, the first version is done and ready for the lime light! (warts and all) It fully implements the FenceAgentAPI, including independent sensing of a Node's on/off status. There is a simple installer and uninstaller that has been tested on CentOS 5.x and should work on any system supporting Red Hat style clustering. It is built as a shield for the Arduino development platform and uses entirely "off the shelf" parts that can be ordered online or bought at most electronic parts shops. The software and build instructions are fully documented. The goal of this project is to let people build very inexpensive clusters on commodity hardware. The current version supports four nodes. If you want more nodes, simply build a second Node Assassin! The fence agent supports multiple simultaneous Node Assassins. It would also work wonderfully as a secondary fence device. This is my first "official" open source project, so I would love to hear some feedback even if you don't plan to use it yourself. :) If you don't want to build your own, I am working with an embedded systems engineer in the hope of having pre-built units available in the next few months. They will support 8 to 64 nodes each depending on the model. If you think you would be interested in this, please let me know. Whether these see the light of day or not largely depends on the feedback I get. Thanks for your time reading this! -- Digimer E-Mail: linux at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From rpeterso at redhat.com Mon Apr 19 18:52:36 2010 From: rpeterso at redhat.com (Bob Peterson) Date: Mon, 19 Apr 2010 14:52:36 -0400 (EDT) Subject: [Linux-cluster] Ondisk and fsck bitmaps differ at block XXXXXX In-Reply-To: <878532044.660811271702883444.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <2060378290.661521271703156141.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> ----- "de Jong, Mark-Jan" wrote: | Hello, | | After running our two node cluster concurrently for a number of days, | I wanted to take a look at the health of the underlying GFS2 file | system. | | | | And although we didn?t run into any problems during or testing, I was | surprised to still see hundreds of thousands, if not millions, of the | following message when running and gfs2_fsck on the file system: | | | | Ondisk and fsck bitmaps differ at block 86432030 (0x526d91e) | | Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) | | Metadata type is 0 (free) | | Succeeded. | | | | I assume this is not good, although processes reading and writing | to/from the filesystem seem to be running smoothly. I ran the fsck | days after the last write operation on the cluster. | | | | I?m currently running the following on Centos 5.4: | | | | kernel-2.6.18-164.15.1.el5 | | cman-2.0.115-1.el5_4.9 | | lvm2-cluster-2.02.46-8.el5_4.1 | | | | The GFS2 file system is running on top of a clustered LVM partition. | | | | Any input would be greatly appreciated. | | | | Thanks, | | Mark de Jong Hi Mark, These messages might be due to (1) bugs in fsck.gfs2, (2) the gfs2 kernel module, or (3) corruption leftover from older versions of gfs2. I have a few recommendations: First, try running my latest and greatest "experimental" fsck.gfs2. It can be found on my people page at this location: http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 If you want to wait a day or two, I'll be posting another version there then because I've got an even better version I'm testing now. As for gfs2, we've fixed several bugs in 5.5 so you might want to look into moving up to 5.5 as well. Regards, Bob Peterson Red Hat File Systems From deJongm at TEOCO.com Mon Apr 19 19:48:12 2010 From: deJongm at TEOCO.com (de Jong, Mark-Jan) Date: Mon, 19 Apr 2010 15:48:12 -0400 Subject: [Linux-cluster] Ondisk and fsck bitmaps differ at block XXXXXX In-Reply-To: <2060378290.661521271703156141.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> References: <878532044.660811271702883444.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> <2060378290.661521271703156141.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Message-ID: <5E3DCAE61C95FA4397679425D7275D26382C8B2A@HQ-MX03.us.teo.earth> |-----Original Message----- |From: linux-cluster-bounces at redhat.com [mailto:linux-cluster- |bounces at redhat.com] On Behalf Of Bob Peterson |Sent: Monday, April 19, 2010 2:53 PM |To: linux clustering |Subject: Re: [Linux-cluster] Ondisk and fsck bitmaps differ at block |XXXXXX | |----- "de Jong, Mark-Jan" wrote: || Hello, || || After running our two node cluster concurrently for a number of days, || I wanted to take a look at the health of the underlying GFS2 file || system. || || || || And although we didn?t run into any problems during or testing, I was || surprised to still see hundreds of thousands, if not millions, of the || following message when running and gfs2_fsck on the file system: || || || || Ondisk and fsck bitmaps differ at block 86432030 (0x526d91e) || || Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free) || || Metadata type is 0 (free) || || Succeeded. || || || || I assume this is not good, although processes reading and writing || to/from the filesystem seem to be running smoothly. I ran the fsck || days after the last write operation on the cluster. || || || || I?m currently running the following on Centos 5.4: || || || || kernel-2.6.18-164.15.1.el5 || || cman-2.0.115-1.el5_4.9 || || lvm2-cluster-2.02.46-8.el5_4.1 || || || || The GFS2 file system is running on top of a clustered LVM partition. || || || || Any input would be greatly appreciated. || || || || Thanks, || || Mark de Jong | |Hi Mark, | |These messages might be due to (1) bugs in fsck.gfs2, (2) the gfs2 |kernel |module, or (3) corruption leftover from older versions of gfs2. I have |a few recommendations: As for point 2, are you saying it may be a bug in the latest gfs2 kernel module shipped with 5.4? And point 3, this was a GFS2 file system created with, and used only by, the latest gfs2 utils/kernel module. |First, try running my latest and greatest "experimental" fsck.gfs2. |It can be found on my people page at this location: I just tried this and although I let the last fsck.gfs2 from gfs2-utils-0.1.62 run to completion, I got the following output from your latest: ./fsck.gfs2 -y /dev/store01/data01_shared Initializing fsck Validating Resource Group index. Level 1 RG check. (level 1 passed) RGs: Consistent: 9444 Inconsistent: 239 Fixed: 239 Total: 9683 Starting pass1 Pass1 complete Starting pass1b Pass1b complete Starting pass1c Pass1c completelete. Starting pass2 Pass2 complete Starting pass3 Pass3 complete Starting pass4 Pass4 complete Starting pass5 Pass5 complete The statfs file is wrong: Current statfs values: blocks: 3172545020 (0xbd1931fc) free: 3146073083 (0xbb8543fb) dinodes: 27575 (0x6bb7) Calculated statfs values: blocks: 3172545020 (0xbd1931fc) free: 3163737148 (0xbc92cc3c) dinodes: 9985 (0x2701) The statfs file was fixed. Writing changes to disk gfs2_fsck complete |http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/fsck.gfs2 | |If you want to wait a day or two, I'll be posting another version there |then because I've got an even better version I'm testing now. | |As for gfs2, we've fixed several bugs in 5.5 so you might want to |look into moving up to 5.5 as well. | |Regards, | |Bob Peterson |Red Hat File Systems | Thanks, Mark From swap_project at yahoo.com Mon Apr 19 20:31:59 2010 From: swap_project at yahoo.com (Srija) Date: Mon, 19 Apr 2010 13:31:59 -0700 (PDT) Subject: [Linux-cluster] GFS in cluster In-Reply-To: <5E3DCAE61C95FA4397679425D7275D26382C8B2A@HQ-MX03.us.teo.earth> Message-ID: <750817.90372.qm@web112809.mail.gq1.yahoo.com> Hi, I have created a GFS filey system as shared between three nodes clusters. The file system is being mounted in the three nodes and I set the mount points in the /etc/fstab, Want to know how the cluster will keep the track of the GFS file system. How the fence/lock_dlm will work? Do i need to set the GFS in a service? If yes , what will be the resources under the service? Will be really appreciated if I get some document to proceed further. Thanks. From pradhanparas at gmail.com Mon Apr 19 23:08:39 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Mon, 19 Apr 2010 18:08:39 -0500 Subject: [Linux-cluster] GFS in cluster In-Reply-To: <750817.90372.qm@web112809.mail.gq1.yahoo.com> References: <5E3DCAE61C95FA4397679425D7275D26382C8B2A@HQ-MX03.us.teo.earth> <750817.90372.qm@web112809.mail.gq1.yahoo.com> Message-ID: On Mon, Apr 19, 2010 at 3:31 PM, Srija wrote: > Hi, > > I have created a GFS filey system as shared between three nodes clusters. > The file system is being mounted in > the three nodes and I set the mount points in the /etc/fstab, > > Want to know how the cluster will keep the track of the GFS file system. > How the fence/lock_dlm will work? > Say If the GFS file system on any one node is unresponsive, cluster will fence the node. > > Do i need to set the GFS in a service? If yes , what will be the > resources under the service? > No you don't need to . /etc/fstab is fine. > > Will be really appreciated if I get some document to proceed further. > > Thanks. > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Tue Apr 20 03:26:08 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 20 Apr 2010 05:26:08 +0200 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCCA575.60806@alteeve.com> References: <4BCCA575.60806@alteeve.com> Message-ID: <4BCD1ED0.1020106@redhat.com> On 4/19/2010 8:48 PM, Digimer wrote: > Hi Clustering folks! > > I wanted to announce a new open hardware, open source cluster fence > device: > > Node Assassin - http://nodeassassin.org > > After four months and a lot of help from friends at http://hacklab.to, > the first version is done and ready for the lime light! (warts and all) well.. congratulation!!! I have been looking forward to build something similar myself, but never found the time. > > It fully implements the FenceAgentAPI, including independent sensing > of a Node's on/off status. There is a simple installer and uninstaller > that has been tested on CentOS 5.x and should work on any system > supporting Red Hat style clustering. Do you think we can work out a merge strategy to be able to ship fence_na directly withing cluster releases? I?d love to see it integrated right away. > This is my first "official" open source project, so I would love to > hear some feedback even if you don't plan to use it yourself. :) I am no hw expert (not any more anyway..) but it looks nice. > > If you don't want to build your own, I am working with an embedded > systems engineer in the hope of having pre-built units available in the > next few months. They will support 8 to 64 nodes each depending on the > model. If you think you would be interested in this, please let me know. > Whether these see the light of day or not largely depends on the > feedback I get. Well if you get around to build them, please let me know. I am interested in a bunch of them ;) Cheers Fabio From linux at alteeve.com Tue Apr 20 03:35:12 2010 From: linux at alteeve.com (Digimer) Date: Mon, 19 Apr 2010 23:35:12 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCD1ED0.1020106@redhat.com> References: <4BCCA575.60806@alteeve.com> <4BCD1ED0.1020106@redhat.com> Message-ID: <4BCD20F0.5050103@alteeve.com> On 10-04-19 11:26 PM, Fabio M. Di Nitto wrote: > On 4/19/2010 8:48 PM, Digimer wrote: >> Hi Clustering folks! >> >> I wanted to announce a new open hardware, open source cluster fence >> device: >> >> Node Assassin - http://nodeassassin.org >> >> After four months and a lot of help from friends at http://hacklab.to, >> the first version is done and ready for the lime light! (warts and all) > > well.. congratulation!!! I have been looking forward to build something > similar myself, but never found the time. Thank you! :) >> It fully implements the FenceAgentAPI, including independent sensing >> of a Node's on/off status. There is a simple installer and uninstaller >> that has been tested on CentOS 5.x and should work on any system >> supporting Red Hat style clustering. > > Do you think we can work out a merge strategy to be able to ship > fence_na directly withing cluster releases? I?d love to see it > integrated right away. I've been trying to get in touch with someone at Red Hat do accomplish this very thing. With luck, I will find the right person soon. I designed the software with Red Hat in mind, so hopefully it will be a simple task. >> This is my first "official" open source project, so I would love to >> hear some feedback even if you don't plan to use it yourself. :) > > I am no hw expert (not any more anyway..) but it looks nice. > >> >> If you don't want to build your own, I am working with an embedded >> systems engineer in the hope of having pre-built units available in the >> next few months. They will support 8 to 64 nodes each depending on the >> model. If you think you would be interested in this, please let me know. >> Whether these see the light of day or not largely depends on the >> feedback I get. > > Well if you get around to build them, please let me know. I am > interested in a bunch of them ;) > > Cheers > Fabio Thank you for the interest! We're hoping to have them ready before too long. If I know someone wants a bunch, it will motivate me to get it done that much sooner! -- Digimer E-Mail: linux at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From fdinitto at redhat.com Tue Apr 20 04:08:11 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 20 Apr 2010 06:08:11 +0200 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCD20F0.5050103@alteeve.com> References: <4BCCA575.60806@alteeve.com> <4BCD1ED0.1020106@redhat.com> <4BCD20F0.5050103@alteeve.com> Message-ID: <4BCD28AB.4010600@redhat.com> On 4/20/2010 5:35 AM, Digimer wrote: > On 10-04-19 11:26 PM, Fabio M. Di Nitto wrote: >>> It fully implements the FenceAgentAPI, including independent sensing >>> of a Node's on/off status. There is a simple installer and uninstaller >>> that has been tested on CentOS 5.x and should work on any system >>> supporting Red Hat style clustering. >> >> Do you think we can work out a merge strategy to be able to ship >> fence_na directly withing cluster releases? I?d love to see it >> integrated right away. > > I've been trying to get in touch with someone at Red Hat do accomplish > this very thing. With luck, I will find the right person soon. I > designed the software with Red Hat in mind, so hopefully it will be a > simple task. All relevant people were already in CC to my reply :) So there we are.. >>> If you don't want to build your own, I am working with an embedded >>> systems engineer in the hope of having pre-built units available in the >>> next few months. They will support 8 to 64 nodes each depending on the >>> model. If you think you would be interested in this, please let me know. >>> Whether these see the light of day or not largely depends on the >>> feedback I get. >> >> Well if you get around to build them, please let me know. I am >> interested in a bunch of them ;) >> >> Cheers >> Fabio > > Thank you for the interest! We're hoping to have them ready before too > long. If I know someone wants a bunch, it will motivate me to get it > done that much sooner! I?d be happy to get a 32/64 ports setup (if the price is reasonable). Mostly.. I?ll never have the time to build it myself. Nevermind my super-rusty skills to assemble electronic components ;) Cheers Fabio From bernardchew at gmail.com Tue Apr 20 04:14:08 2010 From: bernardchew at gmail.com (Bernard Chew) Date: Tue, 20 Apr 2010 12:14:08 +0800 Subject: [Linux-cluster] "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" in /var/log/messages In-Reply-To: References: <1270659483.2550.809.camel@localhost.localdomain> Message-ID: > On Fri, Apr 9, 2010 at 4:51 PM, Bernard Chew wrote: >> On Thu, Apr 8, 2010 at 12:58 AM, Steven Dake wrote: >> On Wed, 2010-04-07 at 18:52 +0800, Bernard Chew wrote: >>> Hi all, >>> >>> I noticed "openais[XXXX]" [TOTEM] Retransmit List: XXXXX" repeated >>> every few hours in /var/log/messages. What does the message mean and >>> is it normal? Will this cause fencing to take place eventually? >>> >> This means your network environment dropped packets and totem is >> recovering them. ?This is normal operation, and in future versions such >> as corosync no notification is printed when recovery takes place. >> >> There is a bug, however, fixed in revision 2122 where if the last packet >> in the order is lost, and no new packets are unlost after it, the >> processor will enter a failed to receive state and trigger fencing. >> >> Regards >> -steve >>> Thank you in advance. >>> >>> Regards, >>> Bernard Chew >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > Thank you for the reply Steve! > > The cluster was running fine until last week where 3 nodes restarted > suddenly. I suspect fencing took place since all 3 servers restarted > at the same time but I couldn't find any fence related entries in the > log. I am guessing we hit the bug you mentioned? Will the log indicate > fencing has taken place with regards to the bug you mentioned? > > Also I noticed the message "kernel: clustat[28328]: segfault at > 0000000000000024 rip 0000003b31c75bc0 rsp 00007fff955cb098 error 4" > occasionally; is this related to the TOTEM message or they indicate > another problem? > > Regards, > Bernard Chew > Hi Steve. Just wondering if you can point me to the bug mentioned? Thank you. Regards, Bernard From linux at alteeve.com Tue Apr 20 04:30:16 2010 From: linux at alteeve.com (Digimer) Date: Tue, 20 Apr 2010 00:30:16 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCD28AB.4010600@redhat.com> References: <4BCCA575.60806@alteeve.com> <4BCD1ED0.1020106@redhat.com> <4BCD20F0.5050103@alteeve.com> <4BCD28AB.4010600@redhat.com> Message-ID: <4BCD2DD8.6040904@alteeve.com> On 10-04-20 12:08 AM, Fabio M. Di Nitto wrote: > On 4/20/2010 5:35 AM, Digimer wrote: >> On 10-04-19 11:26 PM, Fabio M. Di Nitto wrote: > >>>> It fully implements the FenceAgentAPI, including independent sensing >>>> of a Node's on/off status. There is a simple installer and uninstaller >>>> that has been tested on CentOS 5.x and should work on any system >>>> supporting Red Hat style clustering. >>> >>> Do you think we can work out a merge strategy to be able to ship >>> fence_na directly withing cluster releases? I?d love to see it >>> integrated right away. >> >> I've been trying to get in touch with someone at Red Hat do accomplish >> this very thing. With luck, I will find the right person soon. I >> designed the software with Red Hat in mind, so hopefully it will be a >> simple task. > > All relevant people were already in CC to my reply :) So there we are.. Wonderful, thank you! (Hello guys! *waves*) >>>> If you don't want to build your own, I am working with an embedded >>>> systems engineer in the hope of having pre-built units available in the >>>> next few months. They will support 8 to 64 nodes each depending on the >>>> model. If you think you would be interested in this, please let me know. >>>> Whether these see the light of day or not largely depends on the >>>> feedback I get. >>> >>> Well if you get around to build them, please let me know. I am >>> interested in a bunch of them ;) >>> >>> Cheers >>> Fabio >> >> Thank you for the interest! We're hoping to have them ready before too >> long. If I know someone wants a bunch, it will motivate me to get it >> done that much sooner! > > I?d be happy to get a 32/64 ports setup (if the price is reasonable). > Mostly.. I?ll never have the time to build it myself. Nevermind my > super-rusty skills to assemble electronic components ;) > > Cheers > Fabio The biggest slow down will be getting safety certifications so that they are legal to sell. The first version we'll get out the door will be the 8 and 16 port versions. I am reluctant to invest in the 24+ until I know there is a market for it. Mainly because I am bank-rolling this myself and I want to make sure there is decent interest first. However, you can use multiple Node Assassins at the same time, so you would be able to scale out very easily. :) -- Digimer E-Mail: linux at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From swhiteho at redhat.com Tue Apr 20 08:48:40 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Tue, 20 Apr 2010 09:48:40 +0100 Subject: [Linux-cluster] GFS in cluster In-Reply-To: <750817.90372.qm@web112809.mail.gq1.yahoo.com> References: <750817.90372.qm@web112809.mail.gq1.yahoo.com> Message-ID: <1271753320.2451.22.camel@localhost> Hi, On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > Hi, > > I have created a GFS filey system as shared between three nodes clusters. > The file system is being mounted in > the three nodes and I set the mount points in the /etc/fstab, > > Want to know how the cluster will keep the track of the GFS file system. > How the fence/lock_dlm will work? > > Do i need to set the GFS in a service? If yes , what will be the > resources under the service? > > Will be really appreciated if I get some document to proceed further. > > Thanks. > > You don't need to set GFS up as a service. Its automatically available on each node its mounted on. Have you seen the docs here?: http://www.redhat.com/docs/manuals/enterprise/ That should be enough to get you started, Steve. From cmaiolino at redhat.com Tue Apr 20 12:26:24 2010 From: cmaiolino at redhat.com (Carlos Maiolino) Date: Tue, 20 Apr 2010 09:26:24 -0300 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCD2DD8.6040904@alteeve.com> References: <4BCCA575.60806@alteeve.com> <4BCD1ED0.1020106@redhat.com> <4BCD20F0.5050103@alteeve.com> <4BCD28AB.4010600@redhat.com> <4BCD2DD8.6040904@alteeve.com> Message-ID: <20100420122623.GB2540@andromeda.usersys.redhat.com> On Tue, Apr 20, 2010 at 12:30:16AM -0400, Digimer wrote: > On 10-04-20 12:08 AM, Fabio M. Di Nitto wrote: > >On 4/20/2010 5:35 AM, Digimer wrote: > >>On 10-04-19 11:26 PM, Fabio M. Di Nitto wrote: > > > >>>> It fully implements the FenceAgentAPI, including independent sensing > >>>>of a Node's on/off status. There is a simple installer and uninstaller > >>>>that has been tested on CentOS 5.x and should work on any system > >>>>supporting Red Hat style clustering. > >>> > >>>Do you think we can work out a merge strategy to be able to ship > >>>fence_na directly withing cluster releases? I?d love to see it > >>>integrated right away. > >> > >>I've been trying to get in touch with someone at Red Hat do accomplish > >>this very thing. With luck, I will find the right person soon. I > >>designed the software with Red Hat in mind, so hopefully it will be a > >>simple task. > > > >All relevant people were already in CC to my reply :) So there we are.. > > Wonderful, thank you! (Hello guys! *waves*) > > >>>> If you don't want to build your own, I am working with an embedded > >>>>systems engineer in the hope of having pre-built units available in the > >>>>next few months. They will support 8 to 64 nodes each depending on the > >>>>model. If you think you would be interested in this, please let me know. > >>>>Whether these see the light of day or not largely depends on the > >>>>feedback I get. > >>> > >>>Well if you get around to build them, please let me know. I am > >>>interested in a bunch of them ;) > >>> > >>>Cheers > >>>Fabio > >> > >>Thank you for the interest! We're hoping to have them ready before too > >>long. If I know someone wants a bunch, it will motivate me to get it > >>done that much sooner! > > That's a nice project, and I'll want one too. Depends if you can ship it to Brazil :) > >I?d be happy to get a 32/64 ports setup (if the price is reasonable). > >Mostly.. I?ll never have the time to build it myself. Nevermind my > >super-rusty skills to assemble electronic components ;) > > > >Cheers > >Fabio > > The biggest slow down will be getting safety certifications so that > they are legal to sell. The first version we'll get out the door > will be the 8 and 16 port versions. I am reluctant to invest in the > 24+ until I know there is a market for it. Mainly because I am > bank-rolling this myself and I want to make sure there is decent > interest first. However, you can use multiple Node Assassins at the > same time, so you would be able to scale out very easily. :) > > -- > Digimer > E-Mail: linux at alteeve.com > AN!Whitepapers: http://alteeve.com > Node Assassin: http://nodeassassin.org > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- --- Best Regards Carlos Eduardo Maiolino Software Maintenance Engineer Red Hat - Global Support Services From linux at alteeve.com Tue Apr 20 16:38:58 2010 From: linux at alteeve.com (Digimer) Date: Tue, 20 Apr 2010 12:38:58 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <20100420122623.GB2540@andromeda.usersys.redhat.com> References: <4BCCA575.60806@alteeve.com> <4BCD1ED0.1020106@redhat.com> <4BCD20F0.5050103@alteeve.com> <4BCD28AB.4010600@redhat.com> <4BCD2DD8.6040904@alteeve.com> <20100420122623.GB2540@andromeda.usersys.redhat.com> Message-ID: <4BCDD8A2.4030103@alteeve.com> On 10-04-20 08:26 AM, Carlos Maiolino wrote: > That's a nice project, and I'll want one too. Depends if you can ship it to Brazil :) Providing there is no export restrictions, I'd have no problem shipping anywhere in the world. I'll be shipping from Canada, but if you don't mind a slower shipment, the cost shouldn't be too high, either. So in short, yup! :) -- Digimer E-Mail: linux at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From lhh at redhat.com Tue Apr 20 17:27:15 2010 From: lhh at redhat.com (Lon Hohberger) Date: Tue, 20 Apr 2010 13:27:15 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <4BCCA575.60806@alteeve.com> References: <4BCCA575.60806@alteeve.com> Message-ID: <1271784435.378.23.camel@localhost.localdomain> On Mon, 2010-04-19 at 14:48 -0400, Digimer wrote: > Hi Clustering folks! > > I wanted to announce a new open hardware, open source cluster fence > device: > > Node Assassin - http://nodeassassin.org > > After four months and a lot of help from friends at > http://hacklab.to, the first version is done and ready for the lime > light! (warts and all) Ok, that's totally awesome. > This is my first "official" open source project, so I would love to > hear some feedback even if you don't plan to use it yourself. :) One of the problems with clustering is the fencing barrier to entry when shared data is at stake -- often it's high cost and not all hardware vendors resell them. -- Lon From jcasale at activenetwerx.com Tue Apr 20 17:41:42 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Tue, 20 Apr 2010 17:41:42 +0000 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <1271784435.378.23.camel@localhost.localdomain> References: <4BCCA575.60806@alteeve.com> <1271784435.378.23.camel@localhost.localdomain> Message-ID: >One of the problems with clustering is the fencing barrier to entry when >shared data is at stake -- often it's high cost and not all hardware >vendors resell them. I'm surprised more people don't just fence with a managed switch, simple and cheap. From linux at alteeve.com Tue Apr 20 18:47:36 2010 From: linux at alteeve.com (Digimer) Date: Tue, 20 Apr 2010 14:47:36 -0400 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: <1271784435.378.23.camel@localhost.localdomain> References: <4BCCA575.60806@alteeve.com> <1271784435.378.23.camel@localhost.localdomain> Message-ID: <4BCDF6C8.5060909@alteeve.com> On 10-04-20 01:27 PM, Lon Hohberger wrote: > On Mon, 2010-04-19 at 14:48 -0400, Digimer wrote: >> Hi Clustering folks! >> >> I wanted to announce a new open hardware, open source cluster fence >> device: >> >> Node Assassin - http://nodeassassin.org >> >> After four months and a lot of help from friends at >> http://hacklab.to, the first version is done and ready for the lime >> light! (warts and all) > > Ok, that's totally awesome. Thank you! :) >> This is my first "official" open source project, so I would love to >> hear some feedback even if you don't plan to use it yourself. :) > > One of the problems with clustering is the fencing barrier to entry when > shared data is at stake -- often it's high cost and not all hardware > vendors resell them. > > -- Lon That was exactly my barrier when I wanted to learn about clustering. I wasn't backed by a company, so I had to build my cluster using my own money. Fancy servers were just not an option. :) -- Digimer E-Mail: linux at alteeve.com AN!Whitepapers: http://alteeve.com Node Assassin: http://nodeassassin.org From kkovachev at varna.net Wed Apr 21 07:03:32 2010 From: kkovachev at varna.net (Kaloyan Kovachev) Date: Wed, 21 Apr 2010 10:03:32 +0300 Subject: [Linux-cluster] Announce: Node Assassin - Open hardware cluster fence device In-Reply-To: References: <4BCCA575.60806@alteeve.com> <1271784435.378.23.camel@localhost.localdomain> Message-ID: On Tue, 20 Apr 2010 17:41:42 +0000, "Joseph L. Casale" wrote: >>One of the problems with clustering is the fencing barrier to entry when >>shared data is at stake -- often it's high cost and not all hardware >>vendors resell them. > > I'm surprised more people don't just fence with a managed switch, simple > and > cheap. simple, but requires manual intervention to actually reboot the node, as the managed switch can only disable the fenced node's access to the cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From fdinitto at redhat.com Wed Apr 21 07:11:09 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Wed, 21 Apr 2010 09:11:09 +0200 Subject: [Linux-cluster] Cluster 3.0.11 stable release Message-ID: <4BCEA50D.8050604@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The cluster team and its community are proud to announce the 3.0.11 stable release from the STABLE3 branch. This release contains a few major bug fixes. We strongly recommend people to update their clusters. In order to build/run the 3.0.11 release you will need: - - corosync 1.2.1 - - openais 1.1.2 - - linux kernel 2.6.31 (only for GFS1 users) The new source tarball can be downloaded here: https://fedorahosted.org/releases/c/l/cluster/cluster-3.0.11.tar.bz2 To report bugs or issues: https://bugzilla.redhat.com/ Would you like to meet the cluster team or members of its community? Join us on IRC (irc.freenode.net #linux-cluster) and share your experience with other sysadministrators or power users. Thanks/congratulations to all people that contributed to achieve this great milestone. Happy clustering, Fabio Under the hood (from 3.0.10): Abhijith Das (2): libgfs2: fix build break caused by patch to bz 455300 gfs2_convert: Does not convert full gfs1 filesystems Bob Peterson (1): gfs2_fsck segfault when statfs system file is missing Christine Caulfield (1): cman: make libcman /dev/zero fd close-on-exec David Teigland (1): dlm_controld: don't log errors after disabling plocks Fabio M. Di Nitto (2): fence-agents: fix build with locales other than C fence_ilo_mp: fix release version Lon Hohberger (4): Revert "resource-agents: Kill correct PIDs during force_unmount" rgmanager: Kill processes correctly w/ force_unmount rgmanager: Allow spaces in fs.sh mount points rgmanager: Minor cleanups for file system agents Marek 'marx' Grac (3): fence_egenera: log file path should be absolute not relative fence_ilo_mp: Proper error message instead of python traceback fencing: Creating manual pages fails when default value is a list cman/lib/libcman.c | 1 + fence/agents/egenera/fence_egenera.pl | 6 +- fence/agents/ilo_mp/fence_ilo_mp.py | 50 ++- fence/agents/lib/fencing.py.py | 4 +- gfs2/convert/gfs2_convert.c | 106 ++++++ gfs2/edit/hexedit.c | 4 +- gfs2/edit/savemeta.c | 4 +- gfs2/fsck/fs_recovery.c | 2 - gfs2/fsck/fsck.h | 9 +- gfs2/fsck/initialize.c | 661 +++++++++++++++++++++++++++++++-- gfs2/fsck/main.c | 114 ------ gfs2/fsck/metawalk.c | 3 +- gfs2/fsck/metawalk.h | 2 + gfs2/fsck/pass1.c | 410 ++++++++++++++++----- gfs2/fsck/rgrepair.c | 59 +++- gfs2/libgfs2/fs_ops.c | 2 +- gfs2/libgfs2/libgfs2.h | 9 +- gfs2/libgfs2/misc.c | 45 +++ gfs2/libgfs2/ondisk.c | 6 +- gfs2/libgfs2/rgrp.c | 4 +- gfs2/libgfs2/structures.c | 26 +- gfs2/libgfs2/super.c | 27 ++- gfs2/mkfs/main_grow.c | 4 +- gfs2/mkfs/main_mkfs.c | 44 --- group/dlm_controld/cpg.c | 20 +- rgmanager/src/resources/clusterfs.sh | 135 +------ rgmanager/src/resources/fs.sh.in | 174 ++-------- rgmanager/src/resources/netfs.sh | 136 +------ scripts/fenceparse | 2 + 29 files changed, 1332 insertions(+), 737 deletions(-) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJLzqULAAoJEFA6oBJjVJ+OpsQP/iEuX3geiumf4XcXSFK1qY0Q 66qOjalct/EbIAHCHhx0JsHi9w8kkJAcwsLSNFzS8agV/8mOC5QQyCj6xK23FzFe YwHgiWjDhwZ3gqgQYDCNqH+8TonM/+5dx34Xph0qyvSipu9KnsTsT68vXsl1G2hI eVjYprRBRjgtfsY/DhZFt1k/KI/u8g/RqD+Auprh+KYAajweFVMdE7zCA1SYxO5c s4dMMkRlY3KMLFl6OpxqQ1YQcKusm9Dx8Iz2vofLsUaoNH0l08r1gxH6KdW2PZ3v bt7Ne3+reIQmdSVf5nkRO0Y+aazm5QpArJ5ny/76ZYohhT0LT8FA0vHK0zQWZjgw THiT4wKa2cBWAUf2MBNVRS0sMZW1dyhl4oXnRqPMGNWADrWA7t/6lpZylTXxM43n vDvOHOpfGE4wxXoOzhBYQ5AwXomib7WGIFNzUS4saQp0MpOLBahllV/WCkuEa+nb XQ7hbtpu6BsV3Y4gkrMTa/qSlryycsHq9tElVuPr9QCcw7M+70H5AukVtEiM9uhT a8eslIp5RproDAephIgfzzbGfOinDzTcP/zDX3Nbhnso7hYaHn+wAQ+CxuWre3bI PwGQjIwb3ZmE+UOYIGfUDewp2TFbAU5fHgQhJi326Emlh66oqgyLO00+X8yNOl0e G0fKG4Pu4L2CjW7wYboJ =C0PA -----END PGP SIGNATURE----- From esggrupos at gmail.com Wed Apr 21 12:22:24 2010 From: esggrupos at gmail.com (ESGLinux) Date: Wed, 21 Apr 2010 14:22:24 +0200 Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? Message-ID: Hi All, I?m mounting a cluster using NFS over GFS and I?m going to store a lucene index on it. There are two nodes that write in this index, and I?m worried about the index corruption. So anyone have implemented something like this? any problem I can find? Thanks in advance, ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.sturm at eprize.com Wed Apr 21 13:02:52 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Wed, 21 Apr 2010 09:02:52 -0400 Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? In-Reply-To: References: Message-ID: <64D0546C5EBBD147B75DE133D798665F055D903E@hugo.eprize.local> We use Lucene over GFS (no NFS), but the design of our application updates Lucene from only one node at a time. In general applications that utilize POSIX locking can handle concurrent updates safely on GFS even with multiple nodes. It wasn't clear to us whether Lucene supports this, however, and in your case NFS adds a layer to the mix. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of ESGLinux Sent: Wednesday, April 21, 2010 8:22 AM To: linux clustering Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? Hi All, I?m mounting a cluster using NFS over GFS and I?m going to store a lucene index on it. There are two nodes that write in this index, and I?m worried about the index corruption. So anyone have implemented something like this? any problem I can find? Thanks in advance, ESG -------------- next part -------------- An HTML attachment was scrubbed... URL: From swap_project at yahoo.com Wed Apr 21 13:42:01 2010 From: swap_project at yahoo.com (Srija) Date: Wed, 21 Apr 2010 06:42:01 -0700 (PDT) Subject: [Linux-cluster] GFS in cluster In-Reply-To: <1271753320.2451.22.camel@localhost> Message-ID: <402159.45856.qm@web112812.mail.gq1.yahoo.com> Thanks Paras and Steve. Yes the document you have mentioned , I already gone through it. I experiented on cluster the httpd service and few other serverices. And now , I am trying to build a GFS filesystem on which I will build the xen guests. This file system is connected to three nodes. So my question was... Anyway I am following whatever you both suggested and proceeding furter. Thanks again. --- On Tue, 4/20/10, Steven Whitehouse wrote: > From: Steven Whitehouse > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Tuesday, April 20, 2010, 4:48 AM > Hi, > > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > > Hi, > > > >? I have created a GFS filey system? as > shared between three nodes clusters. > >? The file system is being mounted in > >? the three nodes and I set the mount points in > the /etc/fstab, > > > >? Want to know how the cluster will keep the track > of the GFS file system. > >? How the fence/lock_dlm will work? > > > >? Do i need to set the GFS in a service? If yes , > what will be the > >? resources under the service? > > > >? Will be really appreciated if I get some > document to proceed further. > > > >? Thanks. > > > > > You don't need to set GFS up as a service. Its > automatically available > on each node its mounted on. Have you seen the docs here?: > > http://www.redhat.com/docs/manuals/enterprise/ > > That should be enough to get you started, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From esggrupos at gmail.com Wed Apr 21 15:00:53 2010 From: esggrupos at gmail.com (ESGLinux) Date: Wed, 21 Apr 2010 17:00:53 +0200 Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D903E@hugo.eprize.local> References: <64D0546C5EBBD147B75DE133D798665F055D903E@hugo.eprize.local> Message-ID: HI, look at the error that happens when two nodes are writing to the index: java.io.IOException: Stale NFS file handle at java.io.RandomAccessFile.writeBytes(Native Method) at java.io.RandomAccessFile.write(RandomAccessFile.java:466) at org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:503) at org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:84) at org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:98) at org.apache.lucene.store.FSIndexOutput.close(FSDirectory.java:506) at org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:48) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:191) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543) I think in this case one node has writen to the index and the other don?t. Could it be a problem? does GFS anything with this situation? Thanks, ESG 2010/4/21 Jeff Sturm > We use Lucene over GFS (no NFS), but the design of our application > updates Lucene from only one node at a time. > > > > In general applications that utilize POSIX locking can handle concurrent > updates safely on GFS even with multiple nodes. It wasn't clear to us > whether Lucene supports this, however, and in your case NFS adds a layer to > the mix. > > > > *From:* linux-cluster-bounces at redhat.com [mailto: > linux-cluster-bounces at redhat.com] *On Behalf Of *ESGLinux > *Sent:* Wednesday, April 21, 2010 8:22 AM > *To:* linux clustering > *Subject:* [Linux-cluster] gfs+nfs+lucene, anyone had tried? > > > > Hi All, > > > > I?m mounting a cluster using NFS over GFS and I?m going to store a lucene > index on it. > > > > There are two nodes that write in this index, and I?m worried about the > index corruption. > > > > So anyone have implemented something like this? any problem I can find? > > > > Thanks in advance, > > > > ESG > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Wed Apr 21 15:14:25 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Wed, 21 Apr 2010 16:14:25 +0100 Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? In-Reply-To: References: <64D0546C5EBBD147B75DE133D798665F055D903E@hugo.eprize.local> Message-ID: <1271862865.2530.42.camel@localhost> Hi, Did you set fsid= on the export? Which NFS version are you using? Steve. On Wed, 2010-04-21 at 17:00 +0200, ESGLinux wrote: > HI, > > > look at the error that happens when two nodes are writing to the > index: > > > java.io.IOException: Stale NFS file handle > at java.io.RandomAccessFile.writeBytes(Native Method) > at java.io.RandomAccessFile.write(RandomAccessFile.java:466) > at > org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:503) > at > org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:84) > at > org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:98) > at > org.apache.lucene.store.FSIndexOutput.close(FSDirectory.java:506) > at > org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:48) > at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:191) > at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709) > at > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686) > at > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543) > > I think in this case one node has writen to the index and the other > don?t. Could it be a problem? does GFS anything with this situation? > > > Thanks, > > > ESG > > > > 2010/4/21 Jeff Sturm > We use Lucene over GFS (no NFS), but the design of our > application updates Lucene from only one node at a time. > > > > In general applications that utilize POSIX locking can handle > concurrent updates safely on GFS even with multiple nodes. It > wasn't clear to us whether Lucene supports this, however, and > in your case NFS adds a layer to the mix. > > > > From:linux-cluster-bounces at redhat.com > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > ESGLinux > Sent: Wednesday, April 21, 2010 8:22 AM > To: linux clustering > Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? > > > > > > Hi All, > > > > > I?m mounting a cluster using NFS over GFS and I?m going to > store a lucene index on it. > > > > > > There are two nodes that write in this index, and I?m worried > about the index corruption. > > > > > > So anyone have implemented something like this? any problem I > can find? > > > > > > Thanks in advance, > > > > > > ESG > > > > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From esggrupos at gmail.com Wed Apr 21 16:17:16 2010 From: esggrupos at gmail.com (ESGLinux) Date: Wed, 21 Apr 2010 18:17:16 +0200 Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? In-Reply-To: <1271862865.2530.42.camel@localhost> References: <64D0546C5EBBD147B75DE133D798665F055D903E@hugo.eprize.local> <1271862865.2530.42.camel@localhost> Message-ID: Hi, in the file /var/lib/nfs/etabI get this: /nfsdata nodo1(rw,sync,wdelay,hide,nocrossmnt,secure,no_root_squash,no_all_squash,no_subtree_check,secure_locks,acl,fsid=45793,mapping=identity,anonuid=65534,anongid=65534) the version of nfs are this: nfs-utils-1.0.9-42.el5 nfs-utils-lib-1.0.8-7.6.el5 thanks ESG 2010/4/21 Steven Whitehouse > Hi, > > Did you set fsid= on the export? Which NFS version are you using? > > Steve. > > On Wed, 2010-04-21 at 17:00 +0200, ESGLinux wrote: > > HI, > > > > > > look at the error that happens when two nodes are writing to the > > index: > > > > > > java.io.IOException: Stale NFS file handle > > at java.io.RandomAccessFile.writeBytes(Native Method) > > at java.io.RandomAccessFile.write(RandomAccessFile.java:466) > > at > > org.apache.lucene.store.FSIndexOutput.flushBuffer(FSDirectory.java:503) > > at > > > org.apache.lucene.store.BufferedIndexOutput.flush(BufferedIndexOutput.java:84) > > at > > > org.apache.lucene.store.BufferedIndexOutput.close(BufferedIndexOutput.java:98) > > at > > org.apache.lucene.store.FSIndexOutput.close(FSDirectory.java:506) > > at > > org.apache.lucene.index.FieldsWriter.close(FieldsWriter.java:48) > > at > > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:191) > > at > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) > > at > > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:709) > > at > > org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:686) > > at > > org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:543) > > > > I think in this case one node has writen to the index and the other > > don?t. Could it be a problem? does GFS anything with this situation? > > > > > > Thanks, > > > > > > ESG > > > > > > > > 2010/4/21 Jeff Sturm > > We use Lucene over GFS (no NFS), but the design of our > > application updates Lucene from only one node at a time. > > > > > > > > In general applications that utilize POSIX locking can handle > > concurrent updates safely on GFS even with multiple nodes. It > > wasn't clear to us whether Lucene supports this, however, and > > in your case NFS adds a layer to the mix. > > > > > > > > From:linux-cluster-bounces at redhat.com > > [mailto:linux-cluster-bounces at redhat.com] On Behalf Of > > ESGLinux > > Sent: Wednesday, April 21, 2010 8:22 AM > > To: linux clustering > > Subject: [Linux-cluster] gfs+nfs+lucene, anyone had tried? > > > > > > > > > > > > Hi All, > > > > > > > > > > I?m mounting a cluster using NFS over GFS and I?m going to > > store a lucene index on it. > > > > > > > > > > > > There are two nodes that write in this index, and I?m worried > > about the index corruption. > > > > > > > > > > > > So anyone have implemented something like this? any problem I > > can find? > > > > > > > > > > > > Thanks in advance, > > > > > > > > > > > > ESG > > > > > > > > > > > > > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From somsaks at gmail.com Wed Apr 21 19:27:55 2010 From: somsaks at gmail.com (Somsak Sriprayoonsakul) Date: Thu, 22 Apr 2010 02:27:55 +0700 Subject: [Linux-cluster] File system slow & crash Message-ID: Hello, We are using GFS2 on 3 nodes cluster, kernel 2.6.18-164.6.1.el5, RHEL/CentOS5, x86_64 with 8-12GB memory in each node. The underlying storage is HP 2312fc smart array equipped with 12 SAS 15K rpm, configured as RAID10 using 10 HDDs + 2 spares. The array has about 4GB cache. Communication is 4Gbps FC, through HP StorageWorks 8/8 Base e-port SAN Switch. Our application is apache version 1.3.41, mostly serving static HTML file + few PHP. Note that, we have to downgrade to 1.3.41 due to application requirement. Apache was configured with 500 MaxClients. Each HTML file is placed in different directory. The PHP script modify HTML file and do some locking prior to HTML modification. We use round-robin DNS to load balance between each web server. The GFS2 storage was formatted with 4 journals, which is run over a LVM volume. We have configured CMAN, QDiskd, Fencing as appropriate and everything works just fine. We used QDiskd since the cluster initially only has 2 nodes. We used manual_fence temporarily since no fencing hardware was configured yet. GFS2 is mounted with noatime,nodiratime option. Initially, the application was running fine. The problem we encountered is that, over time, load average on some nodes would gradually reach about 300-500, where in normal workload the machine should have about 10. When the load piled up, HTML modification will mostly fail. We suspected that this might be plock_rate issue, so we modified cluster.conf configuration as well as adding some more mount options, such as num_glockd=16 and data=writeback to increase the performance. After we successfully reboot the system and mount the volume. We tried ping_pong ( http://wiki.samba.org/index.php/Ping_pong) test to see how fast the lock can perform. The lock speed greatly increase from 100 to 3-5k/sec. However, after running ping_pong on all 3 nodes simultaneously, the ping_pong program hang with D state and we could not kill the process even with SIGKILL. Due to the time constraint, we decided to leave the system as is, letting ping_pong stuck on all nodes while serving web request. After runing for hours, the httpd process got stuck in D state and couldn't be killed. All web serving was not possible at all. We have to reset all machine (unmount was not possible). The machines were back and GFS volume was back to normal. Since we have to reset all machines, I decided to run gfs2_fsck on the volume. So I unmounted GFS2 on all nodes, run gfs2_fsck, answer "y" to many question about freeing block, and I got the volume back. However, the process stuck up occurred again very quickly. More seriously, trying to kill a running process in GFS or unmount it yield kernel panic and suspend the volume. After this, the volume was never back to normal again. The volume will crash (kernel panic) almost immediately when we try to write something to it. This happened even if I removed mount option and just leave noatime and nodiratime. I didn't run gfs2_fsck again yet, since we decided to leave it as is and trying to backup as much data as possible. Sorry for such a long story. In summary, my question is - What could be the cause of load average pile up? Note that sometimes happened only on some nodes, although DNS round robin should fairly distribute workload to all nodes. At the least the load different shouldn't be that much. - Should we run gfs2_fsck again? Why the lock up occur? I have attached our cluster.conf as well as kernel panic log with this e-mail. Thank you very much in advance Best Regards, =========================================== Somsak Sriprayoonsakul INOX -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cluster.conf Type: application/octet-stream Size: 1635 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: panic.log Type: text/x-log Size: 3044 bytes Desc: not available URL: From somsaks at gmail.com Wed Apr 21 19:29:33 2010 From: somsaks at gmail.com (Somsak Sriprayoonsakul) Date: Thu, 22 Apr 2010 02:29:33 +0700 Subject: [Linux-cluster] File system slow & crash In-Reply-To: References: Message-ID: Just notice that, on a node it is using kernel version 2.6.18-164.15.1.el5. Don't sure if the difference has any effect. On Thu, Apr 22, 2010 at 2:27 AM, Somsak Sriprayoonsakul wrote: > Hello, > > We are using GFS2 on 3 nodes cluster, kernel 2.6.18-164.6.1.el5, > RHEL/CentOS5, x86_64 with 8-12GB memory in each node. The underlying storage > is HP 2312fc smart array equipped with 12 SAS 15K rpm, configured as RAID10 > using 10 HDDs + 2 spares. The array has about 4GB cache. Communication is > 4Gbps FC, through HP StorageWorks 8/8 Base e-port SAN Switch. > > Our application is apache version 1.3.41, mostly serving static HTML file + > few PHP. Note that, we have to downgrade to 1.3.41 due to application > requirement. Apache was configured with 500 MaxClients. Each HTML file is > placed in different directory. The PHP script modify HTML file and do some > locking prior to HTML modification. We use round-robin DNS to load balance > between each web server. > > The GFS2 storage was formatted with 4 journals, which is run over a LVM > volume. We have configured CMAN, QDiskd, Fencing as appropriate and > everything works just fine. We used QDiskd since the cluster initially only > has 2 nodes. We used manual_fence temporarily since no fencing hardware was > configured yet. GFS2 is mounted with noatime,nodiratime option. > > Initially, the application was running fine. The problem we encountered is > that, over time, load average on some nodes would gradually reach about > 300-500, where in normal workload the machine should have about 10. When the > load piled up, HTML modification will mostly fail. > > We suspected that this might be plock_rate issue, so we modified > cluster.conf configuration as well as adding some more mount options, such > as num_glockd=16 and data=writeback to increase the performance. After we > successfully reboot the system and mount the volume. We tried ping_pong ( > http://wiki.samba.org/index.php/Ping_pong) test to see how fast the lock > can perform. The lock speed greatly increase from 100 to 3-5k/sec. However, > after running ping_pong on all 3 nodes simultaneously, the ping_pong program > hang with D state and we could not kill the process even with SIGKILL. > > Due to the time constraint, we decided to leave the system as is, letting > ping_pong stuck on all nodes while serving web request. After runing for > hours, the httpd process got stuck in D state and couldn't be killed. All > web serving was not possible at all. We have to reset all machine (unmount > was not possible). The machines were back and GFS volume was back to normal. > > > Since we have to reset all machines, I decided to run gfs2_fsck on the > volume. So I unmounted GFS2 on all nodes, run gfs2_fsck, answer "y" to many > question about freeing block, and I got the volume back. However, the > process stuck up occurred again very quickly. More seriously, trying to kill > a running process in GFS or unmount it yield kernel panic and suspend the > volume. > > After this, the volume was never back to normal again. The volume will > crash (kernel panic) almost immediately when we try to write something to > it. This happened even if I removed mount option and just leave noatime and > nodiratime. I didn't run gfs2_fsck again yet, since we decided to leave it > as is and trying to backup as much data as possible. > > Sorry for such a long story. In summary, my question is > > > - What could be the cause of load average pile up? Note that sometimes > happened only on some nodes, although DNS round robin should fairly > distribute workload to all nodes. At the least the load different shouldn't > be that much. > - Should we run gfs2_fsck again? Why the lock up occur? > > > I have attached our cluster.conf as well as kernel panic log with this > e-mail. > > > Thank you very much in advance > > Best Regards, > > =========================================== > Somsak Sriprayoonsakul > > INOX > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swhiteho at redhat.com Thu Apr 22 09:56:25 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Thu, 22 Apr 2010 10:56:25 +0100 Subject: [Linux-cluster] File system slow & crash In-Reply-To: References: Message-ID: <1271930185.2748.19.camel@localhost> Hi, On Thu, 2010-04-22 at 02:29 +0700, Somsak Sriprayoonsakul wrote: > Just notice that, on a node it is using kernel version > 2.6.18-164.15.1.el5. Don't sure if the difference has any effect. > > On Thu, Apr 22, 2010 at 2:27 AM, Somsak Sriprayoonsakul > wrote: > Hello, > > We are using GFS2 on 3 nodes cluster, kernel > 2.6.18-164.6.1.el5, RHEL/CentOS5, x86_64 with 8-12GB memory in > each node. The underlying storage is HP 2312fc smart array > equipped with 12 SAS 15K rpm, configured as RAID10 using 10 > HDDs + 2 spares. The array has about 4GB cache. Communication > is 4Gbps FC, through HP StorageWorks 8/8 Base e-port SAN > Switch. > > Our application is apache version 1.3.41, mostly serving > static HTML file + few PHP. Note that, we have to downgrade to > 1.3.41 due to application requirement. Apache was configured > with 500 MaxClients. Each HTML file is placed in different > directory. The PHP script modify HTML file and do some locking > prior to HTML modification. We use round-robin DNS to load > balance between each web server. > Is the PHP script creating new html files (and therefore also new directories) or just modifying existing ones? Ideally you'd set up the system so that all accesses to a particular html file all go to the same node under normal circumstances and only fail over to a different node in the case of that particular node failing. That way you will ensure locality of access under normal conditions and thus get the maximum benefit from the cluster filesystem. >From your description I suspect that its the I/O pattern across nodes which is causing the main problem which you describe. I suspect that the DNS round robin is making the situation worse since it will be effectively randomly assigning requests to nodes. Having said that, killing processes using GFS2 or trying to umount it should not cause an oops. The kill maybe ignored for processes in 'D' (uninterruptible sleep) and likewise the umount may fail with -EBUSY, but any oops is a bug. Please report it via Red Hat's bugzilla. Using the num_glockd= command line parameter is not recommended with GFS2 (in fact it doesn't exist/is ignored in more recent versions) and setting data=writeback may or may not actually improve performance (it depends upon the individual workload) but it does increase the possibility of seeing corrupt data if there is a crash. I would generally caution against using data=writeback except in very special cases. Steve. From mylinuxhalist at gmail.com Thu Apr 22 13:48:37 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Thu, 22 Apr 2010 09:48:37 -0400 Subject: [Linux-cluster] fence_ilo halt instead reboot In-Reply-To: <1271688898.11980.8.camel@localhost.localdomain> References: <1271688898.11980.8.camel@localhost.localdomain> Message-ID: Neat; hopefully it will be available soon. On Mon, Apr 19, 2010 at 10:54 AM, Lon Hohberger wrote: > On Wed, 2010-04-14 at 19:07 +0200, ESGLinux wrote: > > Hi All, > > > > > > I?m configuring a two node cluster (they are HP ProLiant DL380 G5) and > > I have configured the fence nodes this way: > > > > > > > login="Administrator" name="ILONODE2" passwd="xxxx"/> > > > > > > the problem is that I have run the command fence_node NODE2 and I have > > seen the halt message and it hasn?t restarted > > > > It should be 'immediate power off'; sounds like you need to disable > acpid. > > Also, take a look at: > > https://bugzilla.redhat.com/show_bug.cgi?id=507514 > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fdinitto at redhat.com Thu Apr 22 14:31:32 2010 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Thu, 22 Apr 2010 16:31:32 +0200 Subject: [Linux-cluster] [PATCH] cman/init.d/cman.in: don't assume chkconfig exists In-Reply-To: <4BA865D0.8070305@redhat.com> References: <20100310212707.GA6057@bogon.sigxcpu.org> <4BA865D0.8070305@redhat.com> Message-ID: <4BD05DC4.6000806@redhat.com> Hi Guido, I haven?t seen any reply to my request for info. Did I lost emails or missed one? Thanks Fabio On 3/23/2010 7:55 AM, Fabio M. Di Nitto wrote: > Hi Guido, > > in future, can you please send patches to cluster-devel at redhat.com? It?s > easier for me to spot them. > > On 3/10/2010 10:27 PM, Guido G?nther wrote: >> Hi, >> attached patch makes sure we don't rely on chkconffig (which doesn't >> exist on Debian based distros). It also checks additionally for >> network-manager since this is the name of the service on Debian/Ubuntu. >> Cheers, >> -- Guido > > I am not entirely sure why we need this patch. > >> # deb based distros >> if [ -d /etc/default ]; then >> [ -f /etc/default/cluster ] && . /etc/default/cluster >> [ -f /etc/default/cman ] && . /etc/default/cman >> [ -z "$LOCK_FILE" ] && LOCK_FILE="/var/lock/cman" >> type chkconfig > /dev/null 2>&1 || alias chkconfig=local_chkconfig >> fi > > local_chkconfig mimics chkconfig behavior on Debian based systems that > don?t have chkconfig. > > It was tested successfully on Ubuntu, but I don?t see why it would work > on Debian. If the local_chkconfig is broken, then we need to fix that as > it is used also for xen_bridge workaround. > > Thanks > Fabio > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From are at gmx.es Fri Apr 23 09:57:48 2010 From: are at gmx.es (Alex Re) Date: Fri, 23 Apr 2010 11:57:48 +0200 Subject: [Linux-cluster] VKM guest managed by cluster Message-ID: <4BD16F1C.7020004@gmx.es> Hi! I have been trying to get a KVM guest running as a clustered service (two node cluster with GFS2 shared images), in order to restart the guest on the alive cluster node, in case the other node crashes. The problem is that I can't get the VM service managed by the cluster daemons (manually start/stop/live migrate my VM guest works fine). This is how my "cluster.conf" file looks like: And these are the errors I'm getting at syslog: Apr 23 11:28:44 nodeB clurgmgrd[5490]: Resource Group Manager Starting Apr 23 11:28:44 nodeB clurgmgrd[5490]: Loading Service Data Apr 23 11:28:45 nodeB clurgmgrd[5490]: Initializing Services Apr 23 11:28:45 nodeB clurgmgrd: [5490]: xend/libvirtd is dead; cannot stop guest00 Apr 23 11:28:45 nodeB clurgmgrd[5490]: stop on vm "guest00" returned 1 (generic error) Apr 23 11:28:45 nodeB clurgmgrd[5490]: Services Initialized Apr 23 11:28:45 nodeB clurgmgrd[5490]: State change: Local UP Apr 23 11:28:51 nodeB clurgmgrd[5490]: Starting stopped service service:guest00_service Apr 23 11:28:51 nodeB clurgmgrd[5490]: start on vm "guest00" returned 127 (unspecified) Apr 23 11:28:51 nodeB clurgmgrd[5490]: #68: Failed to start service:guest00_service; return value: 1 Apr 23 11:28:51 nodeB clurgmgrd[5490]: Stopping service service:guest00_service Apr 23 11:28:51 nodeB clurgmgrd: [5490]: xend/libvirtd is dead; cannot stop guest00 Apr 23 11:28:51 nodeB clurgmgrd[5490]: stop on vm "guest00" returned 1 (generic error) Apr 23 11:28:51 nodeB clurgmgrd[5490]: #12: RG service:guest00_service failed to stop; intervention required Apr 23 11:28:51 nodeB clurgmgrd[5490]: Service service:guest00_service is failed Apr 23 11:28:51 nodeB clurgmgrd[5490]: #13: Service service:guest00_service failed to stop cleanly I have checked the status of the libvirtd daemon, and it's running fine: [root at nodeB ~]# service libvirtd status libvirtd (pid 5352) is running... And all VM guests management using "virsh" is also running fine. I'm using: "cman-2.0.115-1.el5_4.9", "rgmanager-2.0.52-1.el5.centos.2", "libvirt-0.6.3-20.1.el5_4" I'm missing something on the "cluster.conf"??? Or at the libvirtd daemon?? Thanks for your help! Alex. -------------- next part -------------- An HTML attachment was scrubbed... URL: From are at gmx.es Fri Apr 23 13:13:58 2010 From: are at gmx.es (Alex Re) Date: Fri, 23 Apr 2010 15:13:58 +0200 Subject: [Linux-cluster] VKM guest managed by cluster In-Reply-To: <4BD16F1C.7020004@gmx.es> References: <4BD16F1C.7020004@gmx.es> Message-ID: <4BD19D16.4040607@gmx.es> Hi, I have discovered what was I doing wrong... at the Hi! > I have been trying to get a KVM guest running as a clustered service > (two node cluster with GFS2 shared images), in order to restart the > guest on the alive cluster node, in case the other node crashes. The > problem is that I can't get the VM service managed by the cluster > daemons (manually start/stop/live migrate my VM guest works fine). > This is how my "cluster.conf" file looks like: > > > > > > > > > > > > > > > > > > > > > > > > > > > > name="nodeA_ilo" passwd="hpinvent"/> > name="nodeB_ilo" passwd="hpinvent"/> > > > > > > > > > name="guest00_service" recovery="relocate"> > hypervisor="qemu" name="guest00" hypervisor_uri="qemu+ssh:///system" > path="/etc/libvirt/qemu/guest00.xml"> > > > > > > > > > And these are the errors I'm getting at syslog: > Apr 23 11:28:44 nodeB clurgmgrd[5490]: Resource Group Manager > Starting > Apr 23 11:28:44 nodeB clurgmgrd[5490]: Loading Service Data > Apr 23 11:28:45 nodeB clurgmgrd[5490]: Initializing Services > Apr 23 11:28:45 nodeB clurgmgrd: [5490]: xend/libvirtd is dead; > cannot stop guest00 > Apr 23 11:28:45 nodeB clurgmgrd[5490]: stop on vm "guest00" > returned 1 (generic error) > Apr 23 11:28:45 nodeB clurgmgrd[5490]: Services Initialized > Apr 23 11:28:45 nodeB clurgmgrd[5490]: State change: Local UP > Apr 23 11:28:51 nodeB clurgmgrd[5490]: Starting stopped > service service:guest00_service > Apr 23 11:28:51 nodeB clurgmgrd[5490]: start on vm "guest00" > returned 127 (unspecified) > Apr 23 11:28:51 nodeB clurgmgrd[5490]: #68: Failed to start > service:guest00_service; return value: 1 > Apr 23 11:28:51 nodeB clurgmgrd[5490]: Stopping service > service:guest00_service > Apr 23 11:28:51 nodeB clurgmgrd: [5490]: xend/libvirtd is dead; > cannot stop guest00 > Apr 23 11:28:51 nodeB clurgmgrd[5490]: stop on vm "guest00" > returned 1 (generic error) > Apr 23 11:28:51 nodeB clurgmgrd[5490]: #12: RG > service:guest00_service failed to stop; intervention required > Apr 23 11:28:51 nodeB clurgmgrd[5490]: Service > service:guest00_service is failed > Apr 23 11:28:51 nodeB clurgmgrd[5490]: #13: Service > service:guest00_service failed to stop cleanly > > I have checked the status of the libvirtd daemon, and it's running fine: > [root at nodeB ~]# service libvirtd status > libvirtd (pid 5352) is running... > > And all VM guests management using "virsh" is also running fine. > I'm using: "cman-2.0.115-1.el5_4.9", > "rgmanager-2.0.52-1.el5.centos.2", "libvirt-0.6.3-20.1.el5_4" > > I'm missing something on the "cluster.conf"??? Or at the libvirtd daemon?? > Thanks for your help! > > Alex. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From cvermejo at softwarelibreandino.com Fri Apr 23 13:25:34 2010 From: cvermejo at softwarelibreandino.com (Carlos VERMEJO RUIZ) Date: Fri, 23 Apr 2010 08:25:34 -0500 (PET) Subject: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4 In-Reply-To: <13343996.24374.1272028807120.JavaMail.root@zimbra.softwarelibreandino.com> Message-ID: <30657054.24377.1272029134879.JavaMail.root@zimbra.softwarelibreandino.com> Dear Sir / Madame: I am implementing a two node cluster on domU providing apache service "webby" we have them on different dom0. This apache service also are load balancing a JBoss virtual machines but them are not part of the cluster, also I have configured a virtual machine with iscsi target to provide a shared quorum disk so our quorum is 2 votes from 3. The first thing that I noticed is that when I finished configuring the cluster with luci the service webby does not start automatically. I have to enable the service and them it started. Initially I had a problem with the xvm_fence. When I configured in dom0 an individual cluster and start cman on dom0 it used to start fence_xvmd but in one place I read that dom0 had to be in anothe cluster so I created anothe cluster with both dom0, but now they are not starting the fence_xvmd. That is why I am using fence_xvmd as a standalone with this config: fence_xvmd -LX -a 225.0.0.1 -I eth3 When I try from the domU to fence from command line it worked I use the command: fence_xvm -a 225.0.0.1 -I eth1 -H frederick -ddd -o null and produced: Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed In luci I configured the multicast address 225.0.0.1 and interface eth1 for cluster on domU and multicast address 225.0.0.1 and interface eth3 on dom0 by CLI Perhaps the problem I have is for the keys. I use one key that is shared between dom0 and domU on server1 and another key that is also shared between dom0 and domU on server2. Also on server1 I copied the key fence_xvm.key as fence_xvm-host1.key and distibuted to the other domU and both dom0. on server2 I copied the key fence_xvm.key as fence_xvm-host2.key and distibuted to the the other domU and both dom0 My cluster config is the following: Another strange thing is when I do a clustat on vmapache1 it recognizes the webby service as started on vmapache1and both nodes and quorumdisk online but on vmapache clustat only shows both nodes and the quorumdisk online, nothing abour any service. This is the log when I tried to make a migration: Apr 22 21:39:14 vmapache01 ccsd[2183]: Update of cluster.conf complete (version 51 -> 52). Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Reconfiguring Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Loading Service Data Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Applying new configuration #52 Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Stopping changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Restarting changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Starting changed resources. Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:40:08 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:40:09 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:40:09 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:43:29 vmapache01 qdiskd[5855]: Quorum Daemon Initializing Apr 22 21:43:30 vmapache01 qdiskd[5855]: Heuristic: 'ping -c1 -t1 172.19.52.119' UP Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initial score 1/1 Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initialization complete Apr 22 21:43:49 vmapache01 openais[2189]: [CMAN ] quorum device registered Apr 22 21:43:49 vmapache01 qdiskd[5855]: Score sufficient for master operation (1/1; required=1); upgrading Apr 22 21:44:13 vmapache01 qdiskd[5855]: Assuming master role Apr 22 21:47:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:47:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:47:33 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:50:32 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:50:33 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:50:50 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:51 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:50:52 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:52:41 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist If you see something wrong let me know , Any help or ideas will be appreciated. Best regards, ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alain.Moulle at bull.net Fri Apr 23 13:28:19 2010 From: Alain.Moulle at bull.net (Alain.Moulle) Date: Fri, 23 Apr 2010 15:28:19 +0200 Subject: [Linux-cluster] Question about CS5 versus CS4 : cman heartbeat timer and DLM_LOCK_TIMEOUT Message-ID: <4BD1A073.2040708@bull.net> Hi, there was an old problem I faced with CS4 : if we needed to increment the heartbeat timer (which was 21 by default) , there was a linked timer to change in DLM : DLM_LOCK_TIMEOUT which must always be ~1.5 the cman heartbeat timer. I just wanted to know if there is a such association to take care but with CS5 ? Thanks Regards Alain From ccaulfie at redhat.com Fri Apr 23 13:35:34 2010 From: ccaulfie at redhat.com (Christine Caulfield) Date: Fri, 23 Apr 2010 14:35:34 +0100 Subject: [Linux-cluster] Question about CS5 versus CS4 : cman heartbeat timer and DLM_LOCK_TIMEOUT In-Reply-To: <4BD1A073.2040708@bull.net> References: <4BD1A073.2040708@bull.net> Message-ID: <4BD1A226.5060108@redhat.com> On 23/04/10 14:28, Alain.Moulle wrote: > Hi, > > there was an old problem I faced with CS4 : if we needed to increment the > heartbeat timer (which was 21 by default) , there was a linked timer to > change > in DLM : DLM_LOCK_TIMEOUT which must always be ~1.5 the cman heartbeat > timer. > > I just wanted to know if there is a such association to take care but > with CS5 ? No, there's no such thing as the DLM lock timer in RHEL 5. One less thing to worry about :-) Chrissie From jbayles at readytechs.com Fri Apr 23 14:31:26 2010 From: jbayles at readytechs.com (Jonathan Bayles) Date: Fri, 23 Apr 2010 10:31:26 -0400 Subject: [Linux-cluster] Blueprint needed WTS with LVS Message-ID: <386FCF83D8086E4A89655E41CD3B53D359F0CFBF6A@rtexch01> Greetings List, I have a windows terminal server cluster currently controlled by NLB that isn't balancing very well. I would like to see about using LVS to give me a more robust way to control the load. Does anyone have any experience with this, are there any blueprints for this setup? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcasale at activenetwerx.com Fri Apr 23 16:46:50 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Fri, 23 Apr 2010 16:46:50 +0000 Subject: [Linux-cluster] iptables rules Message-ID: I am getting around to enabling iptables on my two node cluster and wondered how common it was to employ outbound rules along with inbound rules. I presume that locking down the in/out directions on the heartbeat interface would mandate an explicit tag... So aside from the obvious requirements of the individual services running on the nodes, any other things to be aware of when doing this? Thanks! jlc From cthulhucalling at gmail.com Fri Apr 23 17:05:20 2010 From: cthulhucalling at gmail.com (Ian Hayes) Date: Fri, 23 Apr 2010 10:05:20 -0700 Subject: [Linux-cluster] iptables rules In-Reply-To: References: Message-ID: We do it all the time. Redhat has a few very good KB articles on how to set up iptables to support RHCS On Fri, Apr 23, 2010 at 9:46 AM, Joseph L. Casale wrote: > I am getting around to enabling iptables on my two node cluster > and wondered how common it was to employ outbound rules along with > inbound rules. > > I presume that locking down the in/out directions on the heartbeat > interface would mandate an explicit tag... > > So aside from the obvious requirements of the individual services running > on the nodes, any other things to be aware of when doing this? > > Thanks! > jlc > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcasale at activenetwerx.com Fri Apr 23 18:43:34 2010 From: jcasale at activenetwerx.com (Joseph L. Casale) Date: Fri, 23 Apr 2010 18:43:34 +0000 Subject: [Linux-cluster] iptables rules In-Reply-To: References: Message-ID: >We do it all the time. Redhat has a few very good KB articles on how to set up iptables to support RHCS Hi, I believe I found the article you reference, http://kbase.redhat.com/faq/docs/DOC-8782 and enabling it on a passive node causes no issues until migrating the service. It just silently fails w/o any specific indication as to why in the logs? Do your rules mirror what's documented in the above article? Thanks! jlc From cthulhucalling at gmail.com Fri Apr 23 19:33:24 2010 From: cthulhucalling at gmail.com (Ian Hayes) Date: Fri, 23 Apr 2010 12:33:24 -0700 Subject: [Linux-cluster] iptables rules In-Reply-To: References: Message-ID: Pretty much, I use a different KB article. What you might want to to is put in a log rule at the end of the chain to see what traffic isn't being being allowed. On Fri, Apr 23, 2010 at 11:43 AM, Joseph L. Casale < jcasale at activenetwerx.com> wrote: > >We do it all the time. Redhat has a few very good KB articles on how to > set up iptables to support RHCS > > Hi, > I believe I found the article you reference, > http://kbase.redhat.com/faq/docs/DOC-8782 > and enabling it on a passive node causes no issues until migrating the > service. > > It just silently fails w/o any specific indication as to why in the logs? > > Do your rules mirror what's documented in the above article? > > Thanks! > jlc > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cvermejo at softwarelibreandino.com Sat Apr 24 03:41:35 2010 From: cvermejo at softwarelibreandino.com (Carlos VERMEJO RUIZ) Date: Fri, 23 Apr 2010 22:41:35 -0500 (PET) Subject: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4 In-Reply-To: <11075438.24746.1272080453689.JavaMail.root@zimbra.softwarelibreandino.com> Message-ID: <33191117.24749.1272080495681.JavaMail.root@zimbra.softwarelibreandino.com> There are two things that I would try. The first one is that the problem seems that multicast traffic is not being propagated well between nodes. One point that I did not mention is that all trafic is going through firewalls and switches, though I open tcp and udp traffic I am not so sure for multicast traffic. I made the test with fence_xvm -a 225.0.0.1 -I eth1 -H vmapache1 -ddd -o null but when I try through luci interface It did not work. Also multicast interfaces are eth1 on domUs and eth3 on dom0s perhaps som point on my config files has something wrong or I have to configure the multicast traffic on the linux interface. The other point is in reference to the keys, I am using tho different keys, one for domU vmapache1 and dom= node1 in the host server1 and another key for the domU vmapache2 and dom= node2 in the host server2. Is it necesary to share a key between domUs? Could I use one key for domUs and dom0s. The third point is to check the network configuration, do I have to configure something on the switches, what about firewall and routers? my domUs have two phisical networks one connected to a switch that is attending to the public and the other one is connected through a crossover cable between domUs. Also on dom0s I have two active interfaces one conected to a switch to attend internal network and the other one eth3 that are using the same physical interface with the crossover cable and are on the same network number for domUs. Any comments will be appreciated. Best regards, Carlos Vermejo Ruiz Dear Sir / Madame: I am implementing a two node cluster on domU providing apache service "webby" we have them on different dom0. This apache service also are load balancing a JBoss virtual machines but them are not part of the cluster, also I have configured a virtual machine with iscsi target to provide a shared quorum disk so our quorum is 2 votes from 3. The first thing that I noticed is that when I finished configuring the cluster with luci the service webby does not start automatically. I have to enable the service and them it started. Initially I had a problem with the xvm_fence. When I configured in dom0 an individual cluster and start cman on dom0 it used to start fence_xvmd but in one place I read that dom0 had to be in anothe cluster so I created anothe cluster with both dom0, but now they are not starting the fence_xvmd. That is why I am using fence_xvmd as a standalone with this config: fence_xvmd -LX -a 225.0.0.1 -I eth3 When I try from the domU to fence from command line it worked I use the command: fence_xvm -a 225.0.0.1 -I eth1 -H frederick -ddd -o null and produced: Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed In luci I configured the multicast address 225.0.0.1 and interface eth1 for cluster on domU and multicast address 225.0.0.1 and interface eth3 on dom0 by CLI Perhaps the problem I have is for the keys. I use one key that is shared between dom0 and domU on server1 and another key that is also shared between dom0 and domU on server2. Also on server1 I copied the key fence_xvm.key as fence_xvm-host1.key and distibuted to the other domU and both dom0. on server2 I copied the key fence_xvm.key as fence_xvm-host2.key and distibuted to the the other domU and both dom0 My cluster config is the following: Another strange thing is when I do a clustat on vmapache1 it recognizes the webby service as started on vmapache1and both nodes and quorumdisk online but on vmapache clustat only shows both nodes and the quorumdisk online, nothing abour any service. This is the log when I tried to make a migration: Apr 22 21:39:14 vmapache01 ccsd[2183]: Update of cluster.conf complete (version 51 -> 52). Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Reconfiguring Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Loading Service Data Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Applying new configuration #52 Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Stopping changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Restarting changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Starting changed resources. Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:40:08 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:40:09 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:40:09 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:43:29 vmapache01 qdiskd[5855]: Quorum Daemon Initializing Apr 22 21:43:30 vmapache01 qdiskd[5855]: Heuristic: 'ping -c1 -t1 172.19.52.119' UP Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initial score 1/1 Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initialization complete Apr 22 21:43:49 vmapache01 openais[2189]: [CMAN ] quorum device registered Apr 22 21:43:49 vmapache01 qdiskd[5855]: Score sufficient for master operation (1/1; required=1); upgrading Apr 22 21:44:13 vmapache01 qdiskd[5855]: Assuming master role Apr 22 21:47:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:47:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:47:33 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:50:32 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:50:33 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:50:50 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:51 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:50:52 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:52:41 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist If you see something wrong let me know , Any help or ideas will be appreciated. Best regards, ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From cvermejo at softwarelibreandino.com Sun Apr 25 15:41:02 2010 From: cvermejo at softwarelibreandino.com (Carlos VERMEJO RUIZ) Date: Sun, 25 Apr 2010 10:41:02 -0500 (PET) Subject: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4 In-Reply-To: <11588742.24813.1272210008360.JavaMail.root@zimbra.softwarelibreandino.com> Message-ID: <25461286.24816.1272210062621.JavaMail.root@zimbra.softwarelibreandino.com> Almost solved: I double check my multicast traffic and found no multicast traffic could pass from server1 to server2, I corrected this changing my host table with the node names pointing to eth3 (the interface that is interconnecting with a crossover cable both machines) and distributing it between domUs and dom0s. I checked multicast communications with "nc -u -vvn -z 5405" . Now both nodes can see their status properly. node2 can see the services that are running on node1, before this they could not see. Another thing I did was to change the keys, now I am using the same key for domUs and dom0s decause fencing was not working. In operations, when I turn off vmapache1(node1), vmapache2(node2) detects it is offline and starts the service on its machine. When node1 comes up the service do not fall back, in my case this is desirable but I tested with fallback enabled and it did not worked. Also migration does not worked and fencing when I send the action to reboot or off, it gaves me a successful answer but it did not turn off or reboots the virtual machine vmapache1. But on dom 0 I found this message: Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Request to fence: vmapache1 vmapache1 is running locally Plain TCP request Failed to call back Could call back for fence request: Bad file descriptor Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Request to fence: vmapache1 vmapache1 is running locally Plain TCP request Failed to call back Could call back for fence request: Bad file descriptor Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Q. This bad file descriptor coul be some error on node name of service, How can I check this name? Also on the logs I found this: Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Checking Existence Of File /var/run/cluster/apache/apache:web1.p id [apache:web1] > Failed - File Doesn't Exist Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Succeed Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: Services Initialized Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: State change: Local UP Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: State change: vmapache2.bogusdomain.com UP Apr 24 22:15:43 vmapache01 clurgmgrd[1842]: Starting stopped service service:web-scs Apr 24 22:15:43 vmapache01 clurgmgrd: [1842]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 24 22:15:45 vmapache01 clurgmgrd: [1842]: Starting Service apache:web1 Apr 24 22:15:45 vmapache01 clurgmgrd[1842]: Service service:web-scs started Apr 24 22:17:56 vmapache01 clurgmgrd[1842]: Stopping service service:web-scs Apr 24 22:17:56 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Failed - Application Is Still Run ning Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Failed Apr 24 22:17:57 vmapache01 clurgmgrd[1842]: stop on apache "web1" returned 1 (generic error) Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: #12: RG service:web-scs failed to stop; intervention required Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: Service service:web-scs is failed Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #70: Failed to relocate service:web-scs; restarting locally Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #43: Service service:web-scs has failed; can not start. Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #2: Service service:web-scs returned failure code. Last Owner: vmapache1.alignet.com Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #4: Administrator intervention required. I must add that in my httpd service I use port 443 with ssl configured with valid digital ssl certs pointing to floating IP and DNS domain registered, and also module modjk configured for load balancing two jboss virtual machines. The configuration file for ssl module has been hardened also. Q. Do I have to change cluster script for service apache. in order to shutdown service properly? any ideas? ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- ----- Mensaje original ----- De: "Carlos VERMEJO RUIZ" Para: linux-cluster at redhat.com Enviados: Viernes, 23 de Abril 2010 22:41:35 Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4 There are two things that I would try. The first one is that the problem seems that multicast traffic is not being propagated well between nodes. One point that I did not mention is that all trafic is going through firewalls and switches, though I open tcp and udp traffic I am not so sure for multicast traffic. I made the test with fence_xvm -a 225.0.0.1 -I eth1 -H vmapache1 -ddd -o null but when I try through luci interface It did not work. Also multicast interfaces are eth1 on domUs and eth3 on dom0s perhaps som point on my config files has something wrong or I have to configure the multicast traffic on the linux interface. The other point is in reference to the keys, I am using tho different keys, one for domU vmapache1 and dom= node1 in the host server1 and another key for the domU vmapache2 and dom= node2 in the host server2. Is it necesary to share a key between domUs? Could I use one key for domUs and dom0s. The third point is to check the network configuration, do I have to configure something on the switches, what about firewall and routers? my domUs have two phisical networks one connected to a switch that is attending to the public and the other one is connected through a crossover cable between domUs. Also on dom0s I have two active interfaces one conected to a switch to attend internal network and the other one eth3 that are using the same physical interface with the crossover cable and are on the same network number for domUs. Any comments will be appreciated. Best regards, Carlos Vermejo Ruiz Dear Sir / Madame: I am implementing a two node cluster on domU providing apache service "webby" we have them on different dom0. This apache service also are load balancing a JBoss virtual machines but them are not part of the cluster, also I have configured a virtual machine with iscsi target to provide a shared quorum disk so our quorum is 2 votes from 3. The first thing that I noticed is that when I finished configuring the cluster with luci the service webby does not start automatically. I have to enable the service and them it started. Initially I had a problem with the xvm_fence. When I configured in dom0 an individual cluster and start cman on dom0 it used to start fence_xvmd but in one place I read that dom0 had to be in anothe cluster so I created anothe cluster with both dom0, but now they are not starting the fence_xvmd. That is why I am using fence_xvmd as a standalone with this config: fence_xvmd -LX -a 225.0.0.1 -I eth3 When I try from the domU to fence from command line it worked I use the command: fence_xvm -a 225.0.0.1 -I eth1 -H frederick -ddd -o null and produced: Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed In luci I configured the multicast address 225.0.0.1 and interface eth1 for cluster on domU and multicast address 225.0.0.1 and interface eth3 on dom0 by CLI Perhaps the problem I have is for the keys. I use one key that is shared between dom0 and domU on server1 and another key that is also shared between dom0 and domU on server2. Also on server1 I copied the key fence_xvm.key as fence_xvm-host1.key and distibuted to the other domU and both dom0. on server2 I copied the key fence_xvm.key as fence_xvm-host2.key and distibuted to the the other domU and both dom0 My cluster config is the following: Another strange thing is when I do a clustat on vmapache1 it recognizes the webby service as started on vmapache1and both nodes and quorumdisk online but on vmapache clustat only shows both nodes and the quorumdisk online, nothing abour any service. This is the log when I tried to make a migration: Apr 22 21:39:14 vmapache01 ccsd[2183]: Update of cluster.conf complete (version 51 -> 52). Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Reconfiguring Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Loading Service Data Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Applying new configuration #52 Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Stopping changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Restarting changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Starting changed resources. Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:40:08 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:40:09 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:40:09 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:43:29 vmapache01 qdiskd[5855]: Quorum Daemon Initializing Apr 22 21:43:30 vmapache01 qdiskd[5855]: Heuristic: 'ping -c1 -t1 172.19.52.119' UP Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initial score 1/1 Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initialization complete Apr 22 21:43:49 vmapache01 openais[2189]: [CMAN ] quorum device registered Apr 22 21:43:49 vmapache01 qdiskd[5855]: Score sufficient for master operation (1/1; required=1); upgrading Apr 22 21:44:13 vmapache01 qdiskd[5855]: Assuming master role Apr 22 21:47:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:47:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:47:33 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:50:32 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:50:33 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:50:50 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:51 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:50:52 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:52:41 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist If you see something wrong let me know , Any help or ideas will be appreciated. Best regards, ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From cvermejo at softwarelibreandino.com Sun Apr 25 15:57:47 2010 From: cvermejo at softwarelibreandino.com (Carlos VERMEJO RUIZ) Date: Sun, 25 Apr 2010 10:57:47 -0500 (PET) Subject: [Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4 In-Reply-To: <25461286.24816.1272210062621.JavaMail.root@zimbra.softwarelibreandino.com> Message-ID: <23389327.24819.1272211067096.JavaMail.root@zimbra.softwarelibreandino.com> Some corrections: I change domain names to bogusdomain .com ----- Mensaje original ----- De: "Carlos VERMEJO RUIZ" Para: linux-cluster at redhat.com Enviados: Domingo, 25 de Abril 2010 10:41:02 Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4 Almost solved: I double check my multicast traffic and found no multicast traffic could pass from server1 to server2, I corrected this changing my host table with the node names pointing to eth3 (the interface that is interconnecting with a crossover cable both machines) and distributing it between domUs and dom0s. I checked multicast communications with "nc -u -vvn -z 5405" . Now both nodes can see their status properly. node2 can see the services that are running on node1, before this they could not see. Another thing I did was to change the keys, now I am using the same key for domUs and dom0s decause fencing was not working. In operations, when I turn off vmapache1(node1), vmapache2(node2) detects it is offline and starts the service on its machine. When node1 comes up the service do not fall back, in my case this is desirable but I tested with fallback enabled and it did not worked. Also migration does not worked and fencing when I send the action to reboot or off, it gaves me a successful answer but it did not turn off or reboots the virtual machine vmapache1. But on dom 0 I found this message: Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Request to fence: vmapache1 vmapache1 is running locally Plain TCP request Failed to call back Could call back for fence request: Bad file descriptor Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Request to fence: vmapache1 vmapache1 is running locally Plain TCP request Failed to call back Could call back for fence request: Bad file descriptor Domain UUID Owner State ------ ---- ----- ----- Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 vmapache1 a41132aa-6dc3-137c-e4a7-57c31ba5208a 00001 00002 vmjboss1 11f0d6c1-8c51-a792-67a8-807ceaa7157b 00001 00002 Q. This bad file descriptor coul be some error on node name of service, How can I check this name? Also on the logs I found this: Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Checking Existence Of File /var/run/cluster/apache/apache:web1.p id [apache:web1] > Failed - File Doesn't Exist Apr 24 22:15:38 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Succeed Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: Services Initialized Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: State change: Local UP Apr 24 22:15:38 vmapache01 clurgmgrd[1842]: State change: vmapache2.bogusdomain.com UP Apr 24 22:15:43 vmapache01 clurgmgrd[1842]: Starting stopped service service:web-scs Apr 24 22:15:43 vmapache01 clurgmgrd: [1842]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 24 22:15:45 vmapache01 clurgmgrd: [1842]: Starting Service apache:web1 Apr 24 22:15:45 vmapache01 clurgmgrd[1842]: Service service:web-scs started Apr 24 22:17:56 vmapache01 clurgmgrd[1842]: Stopping service service:web-scs Apr 24 22:17:56 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Failed - Application Is Still Run ning Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Stopping Service apache:web1 > Failed Apr 24 22:17:57 vmapache01 clurgmgrd[1842]: stop on apache "web1" returned 1 (generic error) Apr 24 22:17:57 vmapache01 clurgmgrd: [1842]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: #12: RG service:web-scs failed to stop; intervention required Apr 24 22:18:07 vmapache01 clurgmgrd[1842]: Service service:web-scs is failed Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #70: Failed to relocate service:web-scs; restarting locally Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #43: Service service:web-scs has failed; can not start. Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #2: Service service:web-scs returned failure code. Last Owner: vmapache1.bogusdomain.com Apr 24 22:18:08 vmapache01 clurgmgrd[1842]: #4: Administrator intervention required. I must add that in my httpd service I use port 443 with ssl configured with valid digital ssl certs pointing to floating IP and DNS domain registered, and also apache module mod-jk configured for load balancing two jboss virtual machines. The configuration file for ssl module has been hardened also. Q. Do I have to change cluster script for service apache. in order to shutdown service properly? any ideas? ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- ----- Mensaje original ----- De: "Carlos VERMEJO RUIZ" Para: linux-cluster at redhat.com Enviados: Viernes, 23 de Abril 2010 22:41:35 Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4 There are two things that I would try. The first one is that the problem seems that multicast traffic is not being propagated well between nodes. One point that I did not mention is that all trafic is going through firewalls and switches, though I open tcp and udp traffic I am not so sure for multicast traffic. I made the test with fence_xvm -a 225.0.0.1 -I eth1 -H vmapache1 -ddd -o null but when I try through luci interface It did not work. Also multicast interfaces are eth1 on domUs and eth3 on dom0s perhaps som point on my config files has something wrong or I have to configure the multicast traffic on the linux interface. The other point is in reference to the keys, I am using tho different keys, one for domU vmapache1 and dom= node1 in the host server1 and another key for the domU vmapache2 and dom= node2 in the host server2. Is it necesary to share a key between domUs? Could I use one key for domUs and dom0s. The third point is to check the network configuration, do I have to configure something on the switches, what about firewall and routers? my domUs have two phisical networks one connected to a switch that is attending to the public and the other one is connected through a crossover cable between domUs. Also on dom0s I have two active interfaces one conected to a switch to attend internal network and the other one eth3 that are using the same physical interface with the crossover cable and are on the same network number for domUs. Any comments will be appreciated. Best regards, Carlos Vermejo Ruiz Dear Sir / Madame: I am implementing a two node cluster on domU providing apache service "webby" we have them on different dom0. This apache service also are load balancing a JBoss virtual machines but them are not part of the cluster, also I have configured a virtual machine with iscsi target to provide a shared quorum disk so our quorum is 2 votes from 3. The first thing that I noticed is that when I finished configuring the cluster with luci the service webby does not start automatically. I have to enable the service and them it started. Initially I had a problem with the xvm_fence. When I configured in dom0 an individual cluster and start cman on dom0 it used to start fence_xvmd but in one place I read that dom0 had to be in anothe cluster so I created anothe cluster with both dom0, but now they are not starting the fence_xvmd. That is why I am using fence_xvmd as a standalone with this config: fence_xvmd -LX -a 225.0.0.1 -I eth3 When I try from the domU to fence from command line it worked I use the command: fence_xvm -a 225.0.0.1 -I eth1 -H frederick -ddd -o null and produced: Waiting for connection from XVM host daemon. Issuing TCP challenge Responding to TCP challenge TCP Exchange + Authentication done... Waiting for return value from XVM host Remote: Operation failed In luci I configured the multicast address 225.0.0.1 and interface eth1 for cluster on domU and multicast address 225.0.0.1 and interface eth3 on dom0 by CLI Perhaps the problem I have is for the keys. I use one key that is shared between dom0 and domU on server1 and another key that is also shared between dom0 and domU on server2. Also on server1 I copied the key fence_xvm.key as fence_xvm-host1.key and distibuted to the other domU and both dom0. on server2 I copied the key fence_xvm.key as fence_xvm-host2.key and distibuted to the the other domU and both dom0 My cluster config is the following: Another strange thing is when I do a clustat on vmapache1 it recognizes the webby service as started on vmapache1and both nodes and quorumdisk online but on vmapache clustat only shows both nodes and the quorumdisk online, nothing abour any service. This is the log when I tried to make a migration: Apr 22 21:39:14 vmapache01 ccsd[2183]: Update of cluster.conf complete (version 51 -> 52). Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Reconfiguring Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Loading Service Data Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Applying new configuration #52 Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Stopping changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Restarting changed resources. Apr 22 21:39:23 vmapache01 clurgmgrd[2331]: Starting changed resources. Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:40:07 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:40:07 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:40:08 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:40:09 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:40:09 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:43:29 vmapache01 qdiskd[5855]: Quorum Daemon Initializing Apr 22 21:43:30 vmapache01 qdiskd[5855]: Heuristic: 'ping -c1 -t1 172.19.52.119' UP Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initial score 1/1 Apr 22 21:43:49 vmapache01 qdiskd[5855]: Initialization complete Apr 22 21:43:49 vmapache01 openais[2189]: [CMAN ] quorum device registered Apr 22 21:43:49 vmapache01 qdiskd[5855]: Score sufficient for master operation (1/1; required=1); upgrading Apr 22 21:44:13 vmapache01 qdiskd[5855]: Assuming master role Apr 22 21:47:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:47:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:47:33 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:47:33 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:47:43 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Succeed Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Service service:webby is disabled Apr 22 21:50:31 vmapache01 clurgmgrd[2331]: Starting disabled service service:webby Apr 22 21:50:31 vmapache01 clurgmgrd: [2331]: Adding IPv4 address 172.19.52.120/24 to eth0 Apr 22 21:50:32 vmapache01 clurgmgrd: [2331]: Starting Service apache:httpd Apr 22 21:50:33 vmapache01 clurgmgrd[2331]: Service service:webby started Apr 22 21:50:50 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:50:51 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed - Application Is Still Running Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd > Failed Apr 22 21:50:52 vmapache01 clurgmgrd[2331]: stop on apache "httpd" returned 1 (generic error) Apr 22 21:50:52 vmapache01 clurgmgrd: [2331]: Removing IPv4 address 172.19.52.120/24 from eth0 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #12: RG service:webby failed to stop; intervention required Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: Service service:webby is failed Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #70: Failed to relocate service:webby; restarting locally Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #43: Service service:webby has failed; can not start. Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #2: Service service:webby returned failure code. Last Owner: 172.19.52.121 Apr 22 21:51:02 vmapache01 clurgmgrd[2331]: #4: Administrator intervention required. Apr 22 21:52:41 vmapache01 clurgmgrd[2331]: Stopping service service:webby Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Stopping Service apache:httpd Apr 22 21:52:41 vmapache01 clurgmgrd: [2331]: Checking Existence Of File /var/run/cluster/apache/apache:httpd.pid [apache:httpd] > Failed - File Doesn't Exist If you see something wrong let me know , Any help or ideas will be appreciated. Best regards, ----------------------------------------- Carlos Vermejo Ruiz ------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From jakov.sosic at srce.hr Sun Apr 25 20:05:00 2010 From: jakov.sosic at srce.hr (Jakov Sosic) Date: Sun, 25 Apr 2010 22:05:00 +0200 Subject: [Linux-cluster] Cluster v2 online adding nodes? Message-ID: <4BD4A06C.4090402@srce.hr> Hi. Can I add or remove a node from cluster by just adding/removing it from cluster.conf? Is that kind of cluster reconfiguration supported without reboots? Also, what about quorum disk... I had a problem with adding a quorum disk without rebooting... Thank you. -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | From jeff.sturm at eprize.com Sun Apr 25 23:33:45 2010 From: jeff.sturm at eprize.com (Jeff Sturm) Date: Sun, 25 Apr 2010 19:33:45 -0400 Subject: [Linux-cluster] Cluster v2 online adding nodes? In-Reply-To: <4BD4A06C.4090402@srce.hr> References: <4BD4A06C.4090402@srce.hr> Message-ID: <64D0546C5EBBD147B75DE133D798665F055D90A8@hugo.eprize.local> > -----Original Message----- > Can I add or remove a node from cluster by just adding/removing it from > cluster.conf? Is that kind of cluster reconfiguration supported without > reboots? Provided your cluster has more than 2 nodes, this should work fine. (I believe two node clusters are special.) Make sure you bump the version number of cluster.conf after any change. We also run "ccs_tool update" afterwards to propagate the new config to other nodes before starting cman on the new node. If removing a node, we stop it before removing from the config file. Jeff From ccaulfie at redhat.com Mon Apr 26 07:36:26 2010 From: ccaulfie at redhat.com (Christine Caulfield) Date: Mon, 26 Apr 2010 08:36:26 +0100 Subject: [Linux-cluster] Cluster v2 online adding nodes? In-Reply-To: <64D0546C5EBBD147B75DE133D798665F055D90A8@hugo.eprize.local> References: <4BD4A06C.4090402@srce.hr> <64D0546C5EBBD147B75DE133D798665F055D90A8@hugo.eprize.local> Message-ID: <4BD5427A.5010006@redhat.com> On 26/04/10 00:33, Jeff Sturm wrote: >> -----Original Message----- >> Can I add or remove a node from cluster by just adding/removing it > from >> cluster.conf? Is that kind of cluster reconfiguration supported > without >> reboots? > > Provided your cluster has more than 2 nodes, this should work fine. (I > believe two node clusters are special.) > > Make sure you bump the version number of cluster.conf after any change. > We also run "ccs_tool update" afterwards to propagate the new config to > other nodes before starting cman on the new node. If removing a node, > we stop it before removing from the config file. > Yes, you must stop a node before removing it from cluster.conf. Also be aware that removing a node from cluster.conf will not remove it from the "cman_tool" display unless you are running cluster3. Chrissie From mylinuxhalist at gmail.com Mon Apr 26 17:58:16 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Mon, 26 Apr 2010 13:58:16 -0400 Subject: [Linux-cluster] NFS Failover Message-ID: Hi, NFS Setup, 2 servers, stock redhat 5.4. The following is on the SAN: 1) /var/lib/nfs (so that I could preserve locks between 2 severs) 2) /export/home (home area I export to) 3) /export/shared Setup: 1) HA-LVM (so that only 1 NFS server can see the volume at one time) 2) /export/home 192.168.251.0/255.255.255.0(rw,async,no_root_squash,fsid=4000) 3) Shared IP 4) All NFS dynamic ports are locked down to static one 5) rpc.statd is started with "-n " 6) RPCNFSDCOUNT=64 The Service setup (with the parent-child relationship): - Floating IP |- LVM, FileSystem Mounts (to mount /var/lib/nfs, /export/home) |--- nfslock |----- nfs It seems to be working with me failing it over several hundred times. The only issues were that after fail-over some clients can stop writing. Clients mount with defaults,async,noatime,proto=udp. The default is hard-mounting and NFSv3. I test that there are 4 NFS clients and 8 processes/NFS client writing to files while I perform the failover. Some times, there are clients that will stop writing -- this is inconsistent with the fact that it's hard-mounted. I've tried clients with redhat 5.4.x and 5.5 kernels with the same results. timeo, retrans changes do not help as well. I tried TCP option and the clients panic'ked (bugzilla.redhat.com #585269) during fail-over, hence the udp options. I wonder if anyone is seeing the same thing. The annoying thing is that the clients stopped writing only happen some times; not all the times. The failover completed all the time. After the fail-over, the clients can still see the mounted space. I noticed that when the client has issues, the rpciod/6 will shoot up to 100% for several seconds. My processes that are writing files are shot to 100%, then died without finishing writing the files. It feels like a bug on NFS clients; I'm not that certain. I would like to request community help for second opinion. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mylinuxhalist at gmail.com Mon Apr 26 23:59:43 2010 From: mylinuxhalist at gmail.com (My LinuxHAList) Date: Mon, 26 Apr 2010 19:59:43 -0400 Subject: [Linux-cluster] forcefully taking over a service from another node, kdump In-Reply-To: <1271689448.11980.15.camel@localhost.localdomain> References: <1271689448.11980.15.camel@localhost.localdomain> Message-ID: Thanks Lon for the pointer. On Mon, Apr 19, 2010 at 11:04 AM, Lon Hohberger wrote: > On Wed, 2010-04-14 at 14:34 -0400, My LinuxHAList wrote: > > > > > === Working on a new solution === > > > > > > I'm working on a solution for this by a kdump_pre script. > > When node1 panic'ed, before kdumping, it would contact node2 so that > > node2 will attempt to take over the service. > > > > > > At node2, I found running at node1 and issue: > > clusvcadm -r > > > > > > Because of node1's state (it is kdumping), the command just hangs and > > it did not manage to cut down the service down time. > > > > > > What can I do at node2 to forcefully take over the service from node1 > > after node2 is contacted by node1 at kdump_pre stage ? > > > > There's a bugzilla open about this -- you should check out > > https://bugzilla.redhat.com/show_bug.cgi?id=461948 > > There's even a design; just no code at this point. > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eschneid at uccs.edu Tue Apr 27 01:31:55 2010 From: eschneid at uccs.edu (Eric Schneider) Date: Mon, 26 Apr 2010 19:31:55 -0600 Subject: [Linux-cluster] CentOS 4.8, nfs, and fence every 4 days Message-ID: <015401cae5a9$6c35a850$44a0f8f0$@edu> 2 node CentOS 4.8 cluster on ESX 4 cluster (cluster across boxes) [root at host ~]# uname -a Linux hostname 2.6.9-89.0.19.ELlargesmp 2 GB RAM 2 vCPU 1 200 GB RDM - GFS1 VMware fencing Member Status: Quorate Member Name Status ------ ---- ------ Host1 Online, Local, rgmanager Host2 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- www-http host1 started www-nfs host2 started vhostip-http host2 started vhost-http host2 started [root at host ~]# rpm -qa | grep cman cman-kernel-2.6.9-56.7.el4_8.10 cman-kernel-smp-2.6.9-56.7.el4_8.10 cman-devel-1.0.24-1 cman-kernel-largesmp-2.6.9-56.7.el4_8.10 cman-1.0.24-1 cman-kernheaders-2.6.9-56.7.el4_8.10 /var/log/messages Apr 26 18:45:32 tesla kernel: oom-killer: gfp_mask=0xd0 Apr 26 18:45:32 tesla kernel: Mem-info: Apr 26 18:45:32 tesla kernel: Node 0 DMA per-cpu: Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 2, high 6, batch 1 Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 2, batch 1 Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 2, high 6, batch 1 Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 2, batch 1 Apr 26 18:45:32 tesla kernel: Node 0 Normal per-cpu: Apr 26 18:45:32 tesla kernel: cpu 0 hot: low 32, high 96, batch 16 Apr 26 18:45:32 tesla kernel: cpu 0 cold: low 0, high 32, batch 16 Apr 26 18:45:32 tesla kernel: cpu 1 hot: low 32, high 96, batch 16 Apr 26 18:45:32 tesla kernel: cpu 1 cold: low 0, high 32, batch 16 Apr 26 18:45:32 tesla kernel: Node 0 HighMem per-cpu: empty Apr 26 18:45:32 tesla kernel: Apr 26 18:45:32 tesla kernel: Free pages: 6352kB (0kB HighMem) Apr 26 18:45:32 tesla kernel: Active:3245 inactive:3129 dirty:0 writeback:0 unstable:0 free:1588 slab:499421 mapped:4514 pagetables:914 Apr 26 18:45:32 tesla kernel: Node 0 DMA free:752kB min:44kB low:88kB high:132kB active:0kB inactive:0kB present:15996kB pages_scanned:0 all_unreclaimable? yes Apr 26 18:45:32 tesla kernel: protections[]: 0 286000 286000 Apr 26 18:45:32 tesla kernel: Node 0 Normal free:5600kB min:5720kB low:11440kB high:17160kB active:12980kB inactive:12516kB present:2080704kB pages_scanned:20031 all_unreclaimable? yes Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0 Apr 26 18:45:32 tesla kernel: Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no Apr 26 18:45:32 tesla kernel: protections[]: 0 0 0 Apr 26 18:45:32 tesla kernel: Node 0 DMA: 4*4kB 4*8kB 2*16kB 3*32kB 3*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 752kB Apr 26 18:45:32 tesla kernel: Node 0 Normal: 0*4kB 0*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 1*4096kB = 5600kB Apr 26 18:45:32 tesla kernel: Node 0 HighMem: empty Apr 26 18:45:32 tesla kernel: 6192 pagecache pages Every 4 days the host2 system (running NFS service) starts running oom-killer, goes brain dead, and gets fenced. The http processes are restarted every morning at 4:00 AM for log rotates so I don't think they are the problem. Attempts to fix: http://kbase.redhat.com/faq/docs/DOC-3993 http://kbase.redhat.com/faq/docs/DOC-7317 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US &cmd=displayKC&externalId=1002704 Release Found: Red Hat Enterprise Linux 4 Update 4 Symptom: The command top shows a lot of memory is being cached and swap is hardly being used. Solution: On Red Hat Enterprise Release 4 Update 4, a workaround to the oom killer kills random processess while there is still memory available, is to issue the following commend: This will cause page reclamation to happen sooner, thus providing more 'protection' for the zones. Changes to Tesla : [root at host ~]# echo 100 > /proc/sys/vm/lower_zone_protection Anybody have any ideas? Thanks, Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: From emilio.ah at gmail.com Tue Apr 27 11:58:39 2010 From: emilio.ah at gmail.com (Emilio Arjona) Date: Tue, 27 Apr 2010 13:58:39 +0200 Subject: [Linux-cluster] GFS2 and D state HTTPD processes In-Reply-To: References: <1267520814.3405.2.camel@localhost> Message-ID: Thanks Ricardo, We don't want to update the server because it's in production. We will plan a system update in summer when system's load is low. In the last incidents there is a new process involved: [delete_workqueu]. Now, it is usually the initiator of the D-state processes lockout. I have been looking for information about this process but couldn't find out anything. Any idea? Regards :) 2010/4/9 Ricardo Arg?ello > Looks like this bug: > > GFS2 - probably lost glock call back > https://bugzilla.redhat.com/show_bug.cgi?id=498976 > > This is fixed in the kernel included in RHEL 5.5. > Do a "yum update" to fix it. > > Ricardo Arguello > > On Tue, Mar 2, 2010 at 6:10 AM, Emilio Arjona wrote: > > Thanks for your response, Steve. > > > > 2010/3/2 Steven Whitehouse : > >> Hi, > >> > >> On Fri, 2010-02-26 at 16:52 +0100, Emilio Arjona wrote: > >>> Hi, > >>> > >>> we are experiencing some problems commented in an old thread: > >>> > >>> http://www.mail-archive.com/linux-cluster at redhat.com/msg07091.html > >>> > >>> We have 3 clustered servers under Red Hat 5.4 accessing a GFS2 > resource. > >>> > >>> fstab options: > >>> /dev/vg_cluster/lv_cluster /opt/datacluster gfs2 > >>> defaults,noatime,nodiratime,noquota 0 0 > >>> > >>> GFS options: > >>> plock_rate_limit="0" > >>> plock_ownership=1 > >>> > >>> httpd processes run into D status sometimes and the only solution is > >>> hard reset the affected server. > >>> > >>> Can anyone give me some hints to diagnose the problem? > >>> > >>> Thanks :) > >>> > >> Can you give me a rough idea of what the actual workload is and how it > >> is distributed amoung the director(y/ies) ? > > > > We had problems with php sessions in the past but we fixed it by > > configuring php to store the sessions in the database instead of in > > the GFS filesystem. Now, we're having problems with files and > > directories in the "data" folder of Moodle LMS. > > > > "lsof -p" returned a i/o operation over the same folder in 2/3 nodes, > > we did a hard reset of these nodes but some hours after the CPU load > > grew up again, specially in the node that wasn't rebooted. We decided > > to reboot (v?a ssh) this node, then the CPU load went down to normal > > values in all nodes. > > > > I don't think the system's load is high enough to produce concurrent > > access problems. It's more likely to be some misconfiguration, in > > fact, we changed some GFS2 options to non default values to increase > > performance ( > http://www.linuxdynasty.org/howto-increase-gfs2-performance-in-a-cluster.html > ). > > > >> > >> This is often down to contention on glocks (one per inode) and maybe > >> because there is a process of processes writing a file or directory > >> which is in use (either read-only or writable) by other processes. > >> > >> If you are using php, then you might have to strace it to find out what > >> it is really doing, > > > > Ok, we will try to strace the D processes and post the results. Hope > > we find something!! > > > >> > >> Steve. > >> > >>> -- > >>> > >>> Emilio Arjona. > >>> > >>> -- > >>> Linux-cluster mailing list > >>> Linux-cluster at redhat.com > >>> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > >> > >> -- > >> Linux-cluster mailing list > >> Linux-cluster at redhat.com > >> https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > > > > > > -- > > Emilio Arjona. > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- ******************************************* Emilio Arjona Heredia Centro de Ense?anzas Virtuales de la Universidad de Granada C/ Real de Cartuja 36-38 http://cevug.ugr.es Tlfno.: 958-241000 ext. 20206 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From giorgio.luchi at welcomeitalia.it Tue Apr 27 14:15:43 2010 From: giorgio.luchi at welcomeitalia.it (Giorgio Luchi) Date: Tue, 27 Apr 2010 16:15:43 +0200 Subject: [Linux-cluster] GFS2 - force release locks Message-ID: <01aa01cae614$1f488060$5dd98120$@luchi@welcomeitalia.it> Hi to all, We're currently working on setting up a three nodes cluster for managing e-mail. Each node has a local disk for operating system (CentOS 5.4) and three disks shared via GFS2. We plan to split domain across the three nodes to avoid concurrency locking (as much as possible): each node read/write only on one of the shared disks; we also plan to achieve fault tolerance using a Cisco CSS that has, for each domain, one node as primary server and a second node as sorry server in case of fault (or in case of maintenance): in that case one node will take care of the domains of the faulty node and it will read/write on two shared disk. We have this question. Suppose we shut down a node for maintenance. The Cisco CSS recognizes the primary server is down and so it switches the traffic to the sorry server; the cluster does its work and so no problems are noticed by customer. After few days we restart the node; the Cisco CSS restore the traffic to the primary server. At this point all is working again in the "default" scenario, but the domains served by the "maintained node" will have problem in performance due the lock owned by the sorry server. Is there a way to force a node to release all the lock related to a directory (or to a mount point)? If possible, we'd like to do this without umount the shared disk, because to do so it's necessary also to restart all the services that use the three shared disks. Thanks in advance, regards Giorgio Luchi From swap_project at yahoo.com Tue Apr 27 18:43:38 2010 From: swap_project at yahoo.com (Srija) Date: Tue, 27 Apr 2010 11:43:38 -0700 (PDT) Subject: [Linux-cluster] GFS in cluster In-Reply-To: <1271753320.2451.22.camel@localhost> Message-ID: <434320.79056.qm@web112804.mail.gq1.yahoo.com> Hi , Sorry for replying in late, became busy with other work. I have gone through the document you mentioned. I have already mounted the GFS on each node. I am building the guests on this GFS file system and all the nodes are zen hosts. My question is, the file system will be available from each node but the guests will be only available from the node from where the xm create is being executed. - Suppose the guests are in node1. Now if node1 gone down then all the guests will go down too. But if the guests are enabled from all the nodes then if node1 gone down, then still in the other nodes the guests are running. - That's why my first email is for that should I need to create any service so that if any script is being set under the service and in that script ,I can mention xm create create etc etc, to bring up the guests in other node and so on. So pl. advice me now what will be the best procedure to follow in the above scenario which I mentioned. Thanks again --- On Tue, 4/20/10, Steven Whitehouse wrote: > From: Steven Whitehouse > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Tuesday, April 20, 2010, 4:48 AM > Hi, > > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > > Hi, > > > >? I have created a GFS filey system? as > shared between three nodes clusters. > >? The file system is being mounted in > >? the three nodes and I set the mount points in > the /etc/fstab, > > > >? Want to know how the cluster will keep the track > of the GFS file system. > >? How the fence/lock_dlm will work? > > > >? Do i need to set the GFS in a service? If yes , > what will be the > >? resources under the service? > > > >? Will be really appreciated if I get some > document to proceed further. > > > >? Thanks. > > > > > You don't need to set GFS up as a service. Its > automatically available > on each node its mounted on. Have you seen the docs here?: > > http://www.redhat.com/docs/manuals/enterprise/ > > That should be enough to get you started, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > From pradhanparas at gmail.com Tue Apr 27 19:40:16 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Tue, 27 Apr 2010 14:40:16 -0500 Subject: [Linux-cluster] GFS in cluster In-Reply-To: <434320.79056.qm@web112804.mail.gq1.yahoo.com> References: <1271753320.2451.22.camel@localhost> <434320.79056.qm@web112804.mail.gq1.yahoo.com> Message-ID: On Tue, Apr 27, 2010 at 1:43 PM, Srija wrote: > Hi , > > Sorry for replying in late, became busy with other work. > > I have gone through the document you mentioned. I have already mounted the > GFS on each node. I am building the guests on this GFS file system and all > the nodes are zen hosts. > > My question is, the file system will be available from each node but the > guests will be only available from the node from where the xm create is > being executed. > > - Suppose the guests are in node1. Now if node1 gone down then all the > guests will go down too. But if the guests are enabled from all the > nodes then if node1 gone down, then still in the other nodes the guests > are running. > > - That's why my first email is for that should I need to create any > service so that if any script is being set under the service and in > that script ,I can mention xm create create etc etc, to bring up the > guests in other node and so on. > > So pl. advice me now what will be the best procedure to follow in the > above scenario which I mentioned. > If you have GFS mounted on all the nodes, then you need to create your xen hosts on that shared GFS partition and then create virtual machine services to control(recovery,restart) the xen hosts. You can do that from Luci. This might help you http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ Paras. > > Thanks again > > > > > --- On Tue, 4/20/10, Steven Whitehouse wrote: > > > From: Steven Whitehouse > > Subject: Re: [Linux-cluster] GFS in cluster > > To: "linux clustering" > > Date: Tuesday, April 20, 2010, 4:48 AM > > Hi, > > > > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > > > Hi, > > > > > > I have created a GFS filey system as > > shared between three nodes clusters. > > > The file system is being mounted in > > > the three nodes and I set the mount points in > > the /etc/fstab, > > > > > > Want to know how the cluster will keep the track > > of the GFS file system. > > > How the fence/lock_dlm will work? > > > > > > Do i need to set the GFS in a service? If yes , > > what will be the > > > resources under the service? > > > > > > Will be really appreciated if I get some > > document to proceed further. > > > > > > Thanks. > > > > > > > > You don't need to set GFS up as a service. Its > > automatically available > > on each node its mounted on. Have you seen the docs here?: > > > > http://www.redhat.com/docs/manuals/enterprise/ > > > > That should be enough to get you started, > > > > Steve. > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster at redhat.com > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdake at redhat.com Tue Apr 27 23:36:18 2010 From: sdake at redhat.com (Steven Dake) Date: Tue, 27 Apr 2010 16:36:18 -0700 Subject: [Linux-cluster] announcement of the vinzvault project Message-ID: <1272411378.2601.72.camel@localhost.localdomain> CCing openais and linux-cluster ml since there may be some cluster developers interested in participating in this project there. I am pleased to announce we are starting a new project called vinzvault to help resolve some of the difficulties in deploying virtual machines in data-centers. There are other projects that use similar technology or have similar goals as ours. The Ceph filesystem provides a cloud file-system for large scale machines to use as storage. Hail provides a S3 API for accessing information. Cassandra provides a distributed database using techniques similar to what we are planning to provide eventually consistent replicated bigtable style databases. Our project is focused around one goal: providing a small footprint (10kloc) highly available block storage area for virtual machines optimized for Linux data-centers. Our plans don't depend on SAN hardware, software, hardware fencing devices, or any other hardware then is commonly available on commodity hardware. We intend to trade these lower-scale high cost technologies for higher-scale lower cost techniques. Some of our requirements: * Easy to use, deploy, and manage. * 100,000 host count scalability. * Only depend on commodity hardware systems. * Migration works seamlessly within a datacenter without SAN hardware. * VM block images can be replicated to N where N is configurable per VM image. * VM block images can be replicated to various data centers. * Low latency block storage access for all VMs. * Tuneable block sizes per VM. * Use standard network mechanisms to transmit blocks to the various replicas. * Avoid multicast. * Ensure only authorized host machines may connect to the vinzvault storage areas. * No central metadata server - everything is 100% distributed. We plan to execute this project using an overlay DHT hash table called D1HT(1). The 1 in D1HT indicates there is, in a majority of cases, only 1 network request/response required per block of storage. Like all solutions that trade performance for scale/cost, our project may not meet your deployment needs, but we aim to focus on correctness first and performance second. We hope readers will participate in the development of this LGPL/GPL open source project. Our mailing list is vinzvault at fedorahosted.org. One final note - no code is in our repo yet - that is for developers interested in this technology to make happen (this is a from scratch implementation). Lets get cracking! Regards -steve (1) http://www.cos.ufrj.br/~monnerat/D1HT_paper.html From john_simpson at reyrey.com Wed Apr 28 13:07:52 2010 From: john_simpson at reyrey.com (Simpson, John R) Date: Wed, 28 Apr 2010 09:07:52 -0400 Subject: [Linux-cluster] CentOS 4.8, nfs, and fence every 4 days In-Reply-To: <015401cae5a9$6c35a850$44a0f8f0$@edu> References: <015401cae5a9$6c35a850$44a0f8f0$@edu> Message-ID: <67C1678059C61F408194E53907AFB5CC0A30186989@IS-EXMB01-RP.ad.reyrey.com> Have you tried setting the sysctl parameter vm/overcommit_memory to 2 to turn off the overcommit, as described here: http://lwn.net/Articles/104179/ John Simpson Senior Software Engineer, I. T. Engineering and Operations From swap_project at yahoo.com Thu Apr 29 20:18:40 2010 From: swap_project at yahoo.com (Srija) Date: Thu, 29 Apr 2010 13:18:40 -0700 (PDT) Subject: [Linux-cluster] GFS in cluster In-Reply-To: Message-ID: <602996.5266.qm@web112810.mail.gq1.yahoo.com> Thanks Paras, ? Can it be possible to send similar configuration using system-config-cluster.? Or if you can give me some pointers to use the similar configuration using system-config-cluster that will be really appreciated. ? Thanks again ? --- On Tue, 4/27/10, Paras pradhan wrote: From: Paras pradhan Subject: Re: [Linux-cluster] GFS in cluster To: "linux clustering" Date: Tuesday, April 27, 2010, 3:40 PM On Tue, Apr 27, 2010 at 1:43 PM, Srija wrote: Hi , Sorry for replying in late, became busy with other work. I have gone through the document you mentioned. I have already mounted ?the GFS on each node. I am building the guests on this GFS file system and all the nodes are ?zen hosts. My question is, ?the file system will be available from each node but the guests will be only available from the node from where the xm create is being executed. ?- Suppose the guests are in node1. Now if node1 gone down then all the ? ?guests will go down too. But if the guests are enabled from all the ? ?nodes then if node1 gone down, then still in the other nodes the guests ? ?are running. ?- That's why ?my first email is for that should I need to create any ? ?service so ?that if any script is being set under the service and in ? ?that script ,I can mention xm create create etc etc, to bring up the ? ?guests in other node and so on. So pl. advice me now what will be the best procedure ?to follow in the above scenario which I mentioned. If you have GFS mounted on all the nodes, then you need to create your xen hosts on that shared GFS partition and then create virtual machine services to control(recovery,restart) ?the xen hosts. You can do that from Luci. This might help you http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ Paras. ? Thanks again --- On Tue, 4/20/10, Steven Whitehouse wrote: > From: Steven Whitehouse > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Tuesday, April 20, 2010, 4:48 AM > Hi, > > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > > Hi, > > > >? I have created a GFS filey system? as > shared between three nodes clusters. > >? The file system is being mounted in > >? the three nodes and I set the mount points in > the /etc/fstab, > > > >? Want to know how the cluster will keep the track > of the GFS file system. > >? How the fence/lock_dlm will work? > > > >? Do i need to set the GFS in a service? If yes , > what will be the > >? resources under the service? > > > >? Will be really appreciated if I get some > document to proceed further. > > > >? Thanks. > > > > > You don't need to set GFS up as a service. Its > automatically available > on each node its mounted on. Have you seen the docs here?: > > http://www.redhat.com/docs/manuals/enterprise/ > > That should be enough to get you started, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -----Inline Attachment Follows----- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Thu Apr 29 20:51:54 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Thu, 29 Apr 2010 15:51:54 -0500 Subject: [Linux-cluster] GFS in cluster In-Reply-To: <602996.5266.qm@web112810.mail.gq1.yahoo.com> References: <602996.5266.qm@web112810.mail.gq1.yahoo.com> Message-ID: I found luci to be easier.. so I used that. But stopped using it also since Luci is a crap if you have multi path devices and storage clvm cluster. Now I am using command line tools to create/modify GFS shares and vi to edit /etc/cluster.conf file. I guess the redhat docs explains how to use system-config-cluster. Paras. On Thu, Apr 29, 2010 at 3:18 PM, Srija wrote: > Thanks Paras, > > Can it be possible to send similar configuration using > system-config-cluster. Or if you can > give me some pointers to use the similar configuration using > system-config-cluster that will be really appreciated. > > Thanks again > > > --- On *Tue, 4/27/10, Paras pradhan * wrote: > > > From: Paras pradhan > > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Tuesday, April 27, 2010, 3:40 PM > > > > > On Tue, Apr 27, 2010 at 1:43 PM, Srija > > wrote: > >> Hi , >> >> Sorry for replying in late, became busy with other work. >> >> I have gone through the document you mentioned. I have already mounted >> the GFS on each node. I am building the guests on this GFS file system and >> all the nodes are zen hosts. >> >> My question is, the file system will be available from each node but the >> guests will be only available from the node from where the xm create is >> being executed. >> >> - Suppose the guests are in node1. Now if node1 gone down then all the >> guests will go down too. But if the guests are enabled from all the >> nodes then if node1 gone down, then still in the other nodes the guests >> are running. >> >> - That's why my first email is for that should I need to create any >> service so that if any script is being set under the service and in >> that script ,I can mention xm create create etc etc, to bring up the >> guests in other node and so on. >> >> So pl. advice me now what will be the best procedure to follow in the >> above scenario which I mentioned. >> > > If you have GFS mounted on all the nodes, then you need to create your xen > hosts on that shared GFS partition and then create virtual machine services > to control(recovery,restart) the xen hosts. You can do that from Luci. > > This might help you > > http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ > > Paras. > > > >> >> Thanks again >> >> >> >> >> --- On Tue, 4/20/10, Steven Whitehouse > >> wrote: >> >> > From: Steven Whitehouse >> > >> > Subject: Re: [Linux-cluster] GFS in cluster >> > To: "linux clustering" >> > >> > Date: Tuesday, April 20, 2010, 4:48 AM >> > Hi, >> > >> > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: >> > > Hi, >> > > >> > > I have created a GFS filey system as >> > shared between three nodes clusters. >> > > The file system is being mounted in >> > > the three nodes and I set the mount points in >> > the /etc/fstab, >> > > >> > > Want to know how the cluster will keep the track >> > of the GFS file system. >> > > How the fence/lock_dlm will work? >> > > >> > > Do i need to set the GFS in a service? If yes , >> > what will be the >> > > resources under the service? >> > > >> > > Will be really appreciated if I get some >> > document to proceed further. >> > > >> > > Thanks. >> > > >> > > >> > You don't need to set GFS up as a service. Its >> > automatically available >> > on each node its mounted on. Have you seen the docs here?: >> > >> > http://www.redhat.com/docs/manuals/enterprise/ >> > >> > That should be enough to get you started, >> > >> > Steve. >> > >> > >> > -- >> > Linux-cluster mailing list >> > Linux-cluster at redhat.com >> > https://www.redhat.com/mailman/listinfo/linux-cluster >> > >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -----Inline Attachment Follows----- > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From swap_project at yahoo.com Thu Apr 29 23:47:28 2010 From: swap_project at yahoo.com (Srija) Date: Thu, 29 Apr 2010 16:47:28 -0700 (PDT) Subject: [Linux-cluster] GFS in cluster In-Reply-To: Message-ID: <546815.12140.qm@web112814.mail.gq1.yahoo.com> Thanks? for your reply. I am aware of? system-config-cluster also I do configurations? from command line. I am just not clear of the document which is explained in conga. What type of service do i need to create and what are the resources I need to create in that service.? If you explain me that will be really appreciated. If you check my first email, that was my question only. Thanks again --- On Thu, 4/29/10, Paras pradhan wrote: From: Paras pradhan Subject: Re: [Linux-cluster] GFS in cluster To: "linux clustering" Date: Thursday, April 29, 2010, 4:51 PM I found luci to be easier.. so I used that. But stopped using it also since Luci is a crap if you have multi path devices and storage clvm cluster. Now I am using command line tools to create/modify GFS shares and vi to edit /etc/cluster.conf file. I guess the redhat docs explains how to use system-config-cluster. Paras. On Thu, Apr 29, 2010 at 3:18 PM, Srija wrote: Thanks Paras, ? Can it be possible to send similar configuration using system-config-cluster.? Or if you can give me some pointers to use the similar configuration using system-config-cluster that will be really appreciated. ? Thanks again ? --- On Tue, 4/27/10, Paras pradhan wrote: From: Paras pradhan Subject: Re: [Linux-cluster] GFS in cluster To: "linux clustering" Date: Tuesday, April 27, 2010, 3:40 PM On Tue, Apr 27, 2010 at 1:43 PM, Srija wrote: Hi , Sorry for replying in late, became busy with other work. I have gone through the document you mentioned. I have already mounted ?the GFS on each node. I am building the guests on this GFS file system and all the nodes are ?zen hosts. My question is, ?the file system will be available from each node but the guests will be only available from the node from where the xm create is being executed. ?- Suppose the guests are in node1. Now if node1 gone down then all the ? ?guests will go down too. But if the guests are enabled from all the ? ?nodes then if node1 gone down, then still in the other nodes the guests ? ?are running. ?- That's why ?my first email is for that should I need to create any ? ?service so ?that if any script is being set under the service and in ? ?that script ,I can mention xm create create etc etc, to bring up the ? ?guests in other node and so on. So pl. advice me now what will be the best procedure ?to follow in the above scenario which I mentioned. If you have GFS mounted on all the nodes, then you need to create your xen hosts on that shared GFS partition and then create virtual machine services to control(recovery,restart) ?the xen hosts. You can do that from Luci. This might help you http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ Paras. ? Thanks again --- On Tue, 4/20/10, Steven Whitehouse wrote: > From: Steven Whitehouse > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Tuesday, April 20, 2010, 4:48 AM > Hi, > > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: > > Hi, > > > >? I have created a GFS filey system? as > shared between three nodes clusters. > >? The file system is being mounted in > >? the three nodes and I set the mount points in > the /etc/fstab, > > > >? Want to know how the cluster will keep the track > of the GFS file system. > >? How the fence/lock_dlm will work? > > > >? Do i need to set the GFS in a service? If yes , > what will be the > >? resources under the service? > > > >? Will be really appreciated if I get some > document to proceed further. > > > >? Thanks. > > > > > You don't need to set GFS up as a service. Its > automatically available > on each node its mounted on. Have you seen the docs here?: > > http://www.redhat.com/docs/manuals/enterprise/ > > That should be enough to get you started, > > Steve. > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -----Inline Attachment Follows----- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -----Inline Attachment Follows----- -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From pradhanparas at gmail.com Fri Apr 30 15:11:27 2010 From: pradhanparas at gmail.com (Paras pradhan) Date: Fri, 30 Apr 2010 10:11:27 -0500 Subject: [Linux-cluster] GFS in cluster In-Reply-To: <546815.12140.qm@web112814.mail.gq1.yahoo.com> References: <546815.12140.qm@web112814.mail.gq1.yahoo.com> Message-ID: On Thu, Apr 29, 2010 at 6:47 PM, Srija wrote: > Thanks for your reply. > > I am aware of system-config-cluster also I do configurations from command > line. I am > just not clear of the document which is explained in conga. > > What type of service do i need to create and what are the resources I need > to create > in that service. If you explain me that will be really appreciated. > > You only need a Virtual machine service and a Failover domain(s). See "Add a virtual machine service" in conga. Paras. > If you check my first email, that was my question only. > > Thanks again > > > > --- On *Thu, 4/29/10, Paras pradhan * wrote: > > > From: Paras pradhan > Subject: Re: [Linux-cluster] GFS in cluster > To: "linux clustering" > Date: Thursday, April 29, 2010, 4:51 PM > > > I found luci to be easier.. so I used that. But stopped using it also since > Luci is a crap if you have multi path devices and storage clvm cluster. Now > I am using command line tools to create/modify GFS shares and vi to edit > /etc/cluster.conf file. > > I guess the redhat docs explains how to use system-config-cluster. > > Paras. > > > On Thu, Apr 29, 2010 at 3:18 PM, Srija > > wrote: > >> Thanks Paras, >> >> Can it be possible to send similar configuration using >> system-config-cluster. Or if you can >> give me some pointers to use the similar configuration using >> system-config-cluster that will be really appreciated. >> >> Thanks again >> >> >> --- On *Tue, 4/27/10, Paras pradhan >> >* wrote: >> >> >> From: Paras pradhan >> > >> >> Subject: Re: [Linux-cluster] GFS in cluster >> To: "linux clustering" >> > >> Date: Tuesday, April 27, 2010, 3:40 PM >> >> >> >> >> On Tue, Apr 27, 2010 at 1:43 PM, Srija >> > wrote: >> >>> Hi , >>> >>> Sorry for replying in late, became busy with other work. >>> >>> I have gone through the document you mentioned. I have already mounted >>> the GFS on each node. I am building the guests on this GFS file system and >>> all the nodes are zen hosts. >>> >>> My question is, the file system will be available from each node but the >>> guests will be only available from the node from where the xm create is >>> being executed. >>> >>> - Suppose the guests are in node1. Now if node1 gone down then all the >>> guests will go down too. But if the guests are enabled from all the >>> nodes then if node1 gone down, then still in the other nodes the >>> guests >>> are running. >>> >>> - That's why my first email is for that should I need to create any >>> service so that if any script is being set under the service and in >>> that script ,I can mention xm create create etc etc, to bring up the >>> guests in other node and so on. >>> >>> So pl. advice me now what will be the best procedure to follow in the >>> above scenario which I mentioned. >>> >> >> If you have GFS mounted on all the nodes, then you need to create your xen >> hosts on that shared GFS partition and then create virtual machine services >> to control(recovery,restart) the xen hosts. You can do that from Luci. >> >> This might help you >> >> http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ >> >> Paras. >> >> >> >>> >>> Thanks again >>> >>> >>> >>> >>> --- On Tue, 4/20/10, Steven Whitehouse > >>> wrote: >>> >>> > From: Steven Whitehouse >>> > >>> > Subject: Re: [Linux-cluster] GFS in cluster >>> > To: "linux clustering" >>> > >>> > Date: Tuesday, April 20, 2010, 4:48 AM >>> > Hi, >>> > >>> > On Mon, 2010-04-19 at 13:31 -0700, Srija wrote: >>> > > Hi, >>> > > >>> > > I have created a GFS filey system as >>> > shared between three nodes clusters. >>> > > The file system is being mounted in >>> > > the three nodes and I set the mount points in >>> > the /etc/fstab, >>> > > >>> > > Want to know how the cluster will keep the track >>> > of the GFS file system. >>> > > How the fence/lock_dlm will work? >>> > > >>> > > Do i need to set the GFS in a service? If yes , >>> > what will be the >>> > > resources under the service? >>> > > >>> > > Will be really appreciated if I get some >>> > document to proceed further. >>> > > >>> > > Thanks. >>> > > >>> > > >>> > You don't need to set GFS up as a service. Its >>> > automatically available >>> > on each node its mounted on. Have you seen the docs here?: >>> > >>> > http://www.redhat.com/docs/manuals/enterprise/ >>> > >>> > That should be enough to get you started, >>> > >>> > Steve. >>> > >>> > >>> > -- >>> > Linux-cluster mailing list >>> > Linux-cluster at redhat.com >>> > https://www.redhat.com/mailman/listinfo/linux-cluster >>> > >>> >>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -----Inline Attachment Follows----- >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -----Inline Attachment Follows----- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yvette at dbtgroup.com Fri Apr 30 15:59:30 2010 From: yvette at dbtgroup.com (yvette hirth) Date: Fri, 30 Apr 2010 15:59:30 +0000 Subject: [Linux-cluster] gfs2 security issue Message-ID: <4BDAFE62.9020901@dbtgroup.com> i just saw this on a SANS security vulnerability alert. is everyone aware of this? 10.18.18 CVE: Not Available Platform: Linux Title: Linux Kernel "gfs2_quota" Structure Write Local Privilege Escalation Description: The Linux kernel is exposed to a local privilege escalation issue affecting the "gfs2" file system. Specifically, when a "gfs2_quota" structure straddles a page boundary, updates to the structure are not correctly written to disk. This can result in a buffer overflow condition which may lead to memory corruption. Ref: http://www.securityfocus.com/bid/39715 fyi yvette hirth From eschneid at uccs.edu Fri Apr 30 16:08:35 2010 From: eschneid at uccs.edu (Eric Schneider) Date: Fri, 30 Apr 2010 10:08:35 -0600 Subject: [Linux-cluster] CentOS 4.8, nfs, and fence every 4 days In-Reply-To: <67C1678059C61F408194E53907AFB5CC0A30186989@IS-EXMB01-RP.ad.reyrey.com> References: <015401cae5a9$6c35a850$44a0f8f0$@edu> <67C1678059C61F408194E53907AFB5CC0A30186989@IS-EXMB01-RP.ad.reyrey.com> Message-ID: <004c01cae87f$632ffb60$298ff220$@edu> John, I did see the same recommendation from another source, but I was a little hesitant after reading some posts recommending against it. I guess I am not convinced either way. Thanks, Eric -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Simpson, John R Sent: Wednesday, April 28, 2010 7:08 AM To: linux clustering Subject: Re: [Linux-cluster] CentOS 4.8, nfs, and fence every 4 days Have you tried setting the sysctl parameter vm/overcommit_memory to 2 to turn off the overcommit, as described here: http://lwn.net/Articles/104179/ John Simpson Senior Software Engineer, I. T. Engineering and Operations -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From swhiteho at redhat.com Fri Apr 30 16:17:26 2010 From: swhiteho at redhat.com (Steven Whitehouse) Date: Fri, 30 Apr 2010 17:17:26 +0100 Subject: [Linux-cluster] gfs2 security issue In-Reply-To: <4BDAFE62.9020901@dbtgroup.com> References: <4BDAFE62.9020901@dbtgroup.com> Message-ID: <1272644246.2437.44.camel@localhost> Hi, Yes, we know and the fix is pretty much ready to go. It isn't a priv escalation anyway, its memory corruption most likely leading to an oops. Steve. On Fri, 2010-04-30 at 15:59 +0000, yvette hirth wrote: > i just saw this on a SANS security vulnerability alert. is everyone > aware of this? > > 10.18.18 CVE: Not Available > Platform: Linux > Title: Linux Kernel "gfs2_quota" Structure Write Local Privilege > Escalation > Description: The Linux kernel is exposed to a local > privilege escalation issue affecting the "gfs2" file system. > Specifically, when a "gfs2_quota" structure straddles a page boundary, > updates to the structure are not correctly written to disk. This can > result in a buffer overflow condition which may lead to memory > corruption. > Ref: http://www.securityfocus.com/bid/39715 > > fyi > yvette hirth > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From dhoffutt at gmail.com Fri Apr 30 19:37:15 2010 From: dhoffutt at gmail.com (Dusty) Date: Fri, 30 Apr 2010 14:37:15 -0500 Subject: [Linux-cluster] Maximum number of nodes Message-ID: Hello, Regarding the component versions of "Redhat Cluster Suite" as released on the 5.4 and 5.5 ISOs...: What is the maximum number of nodes that will work within a single cluster? >From where do the limitations come? GFS2? Qdisk? What if not using qdisk? What if not using GFS2? Thank you! -------------- next part -------------- An HTML attachment was scrubbed... URL: