From neale at sinenomine.net Tue Sep 2 14:56:52 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 14:56:52 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery Message-ID: Hi, In our two node system if one node fails, the other node takes over the application and uses the shared gfs2 target successfully. However, after the failed node comes back any attempts to lock files on the gfs2 resource results in -ENOSYS. The following test program exhibits the problem - in normal operation the lock succeeds but in the fail/recover scenario we get -ENOSYS: #include #include #include int main(int argc, char **argv) { int fd; struct flock fl; fd = open("/mnt/test.file",O_RDONLY); if (fd != -1) { if (fcntl(fd, F_SETFL, O_RDONLY|O_DSYNC) != -1) { fl.l_type = F_RDLCK; fl.l_whence = SEEK_SET; fl.l_start = 0; fl.l_len = 0; if (fcntl(fd, F_SETLK, &fl) != -1) printf("File locked successfully\n"); else perror("fcntl(F_SETLK)"); } else perror("fcntl(F_SETFL)"); close (fd); } else perror("open"); } I've tracked things down to these messages: 1409631951 lockspace lvclusdidiz0360 plock disabled our sig 816fba01 nodeid 2 sig 2f6b : 1409634840 lockspace lvclusdidiz0360 plock disabled our sig 0 nodeid 2 sig 2f6b Which indicates the lockspace attribute disable_plock has been set by way of the other node calling send_plocks_stored (). Looking at the cpg.c: static void prepare_plocks(struct lockspace *ls) { struct change *cg = list_first_entry(&ls->changes, struct change, list); struct member *memb; uint32_t sig; : : : if (nodes_added(ls)) store_plocks(ls, &sig); send_plocks_stored(ls, sig); } If nodes_added(ls) returns false then an uninitialized "sig" value will be passed to send_plocks_stored(). Do the "our sig" and "sig" values in the above log messages make sense? If this is not the case, what is supposed to happen in order re-enable plocks on the recovered node? Neale From rpeterso at redhat.com Tue Sep 2 15:04:58 2014 From: rpeterso at redhat.com (Bob Peterson) Date: Tue, 2 Sep 2014 11:04:58 -0400 (EDT) Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: Message-ID: <974028307.15165439.1409670298540.JavaMail.zimbra@redhat.com> ----- Original Message ----- > Hi, > In our two node system if one node fails, the other node takes over the > application and uses the shared gfs2 target successfully. However, after > the failed node comes back any attempts to lock files on the gfs2 resource > results in -ENOSYS. The following test program exhibits the problem - in > normal operation the lock succeeds but in the fail/recover scenario we get > -ENOSYS: > > #include > #include > #include > > int > main(int argc, char **argv) > { > int fd; > struct flock fl; > > fd = open("/mnt/test.file",O_RDONLY); > if (fd != -1) { > if (fcntl(fd, F_SETFL, O_RDONLY|O_DSYNC) != -1) { > fl.l_type = F_RDLCK; > fl.l_whence = SEEK_SET; > fl.l_start = 0; > fl.l_len = 0; > if (fcntl(fd, F_SETLK, &fl) != -1) > printf("File locked successfully\n"); > else > perror("fcntl(F_SETLK)"); > } else > perror("fcntl(F_SETFL)"); > close (fd); > } else > perror("open"); > } > > I've tracked things down to these messages: > > 1409631951 lockspace lvclusdidiz0360 plock disabled our sig 816fba01 nodeid 2 > sig 2f6b > : > 1409634840 lockspace lvclusdidiz0360 plock disabled our sig 0 nodeid 2 sig > 2f6b > > Which indicates the lockspace attribute disable_plock has been set by way of > the other node calling send_plocks_stored > (). > > Looking at the cpg.c: > > static void prepare_plocks(struct lockspace *ls) > { > > struct change *cg = list_first_entry(&ls->changes, struct change, list); > > struct member *memb; > uint32_t sig; > > : > : > : > if (nodes_added(ls)) > store_plocks(ls, &sig); > send_plocks_stored(ls, sig); > } > > If nodes_added(ls) returns false then an uninitialized "sig" value will be > passed to send_plocks_stored(). Do the "our sig" and "sig" values in the > above log messages make sense? > > If this is not the case, what is supposed to happen in order re-enable plocks > on the recovered node? > > Neale Hi Neale, For what it's worth: GFS2 just passes plock requests down to the cluster infrastructure. (Unlike flocks which are handled internally by gfs2). It will be important for the cluster folks to know what release this is. At this point I'm not sure if it's openais, corosync or what not. Regards, Bob Peterson Red Hat File Systems From neale at sinenomine.net Tue Sep 2 15:16:49 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 15:16:49 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <974028307.15165439.1409670298540.JavaMail.zimbra@redhat.com> References: <974028307.15165439.1409670298540.JavaMail.zimbra@redhat.com> Message-ID: <88636BE6-E27B-41AC-A7A4-58C32749302A@sinenomine.net> Thanks Bob, It's corosync - corosync-1.4.1-17, cman-3.0.12.1-60, fence-agents-3.1.5-26. Neale On Sep 2, 2014, at 11:04 AM, Bob Peterson wrote: > ----- Original Message ----- > > Hi Neale, > > For what it's worth: GFS2 just passes plock requests down to the cluster > infrastructure. (Unlike flocks which are handled internally by gfs2). It will be > important for the cluster folks to know what release this is. At this point > I'm not sure if it's openais, corosync or what not. From neale at sinenomine.net Tue Sep 2 15:24:19 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 15:24:19 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <974028307.15165439.1409670298540.JavaMail.zimbra@redhat.com> References: <974028307.15165439.1409670298540.JavaMail.zimbra@redhat.com> Message-ID: Forget the snippet of code in my original posting as the code in 3.0.12-60 actually looks like this: if (nodes_added(ls)) { store_plocks(ls, &sig); ls->last_plock_sig = sig; } else { sig = ls->last_plock_sig; } send_plocks_stored(ls, sig); So sig is never uninitialized. However, the question still remains - node 2 disables plocks for node 1. How are the supposed to be re-enabled? Neale On Sep 2, 2014, at 11:04 AM, Bob Peterson wrote: > ----- Original Message ----- >> >> Looking at the cpg.c: >> >> static void prepare_plocks(struct lockspace *ls) >> { >> >> struct change *cg = list_first_entry(&ls->changes, struct change, list); >> >> struct member *memb; >> uint32_t sig; >> >> : >> : >> : >> if (nodes_added(ls)) >> store_plocks(ls, &sig); >> send_plocks_stored(ls, sig); >> } >> >> If nodes_added(ls) returns false then an uninitialized "sig" value will be >> passed to send_plocks_stored(). Do the "our sig" and "sig" values in the >> above log messages make sense? >> >> If this is not the case, what is supposed to happen in order re-enable plocks >> on the recovered node? From teigland at redhat.com Tue Sep 2 15:37:10 2014 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Sep 2014 10:37:10 -0500 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: Message-ID: <20140902153710.GB374@redhat.com> On Tue, Sep 02, 2014 at 02:56:52PM +0000, Neale Ferguson wrote: > 1409631951 lockspace lvclusdidiz0360 > plock disabled our sig 816fba01 nodeid 2 sig 2f6b There is a difference in plock data signatures between the node that wrote the data and the node that read it (this one). This indicates that the plock data was not synced correctly by the openais/corosync checkpoints, or that the signatures were not synced correctly (e.g bug 623816). From neale at sinenomine.net Tue Sep 2 16:02:39 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 16:02:39 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <20140902153710.GB374@redhat.com> References: <20140902153710.GB374@redhat.com> Message-ID: Thanks David, That makes sense as there's this message that precedes the disable message in the log: retrieve_plocks ckpt open error 12 lvclusdidiz0360 Neale On Sep 2, 2014, at 11:37 AM, David Teigland wrote: > On Tue, Sep 02, 2014 at 02:56:52PM +0000, Neale Ferguson wrote: > >> 1409631951 lockspace lvclusdidiz0360 >> plock disabled our sig 816fba01 nodeid 2 sig 2f6b > > There is a difference in plock data signatures between the node that wrote > the data and the node that read it (this one). This indicates that the > plock data was not synced correctly by the openais/corosync checkpoints, > or that the signatures were not synced correctly (e.g bug 623816). > From neale at sinenomine.net Tue Sep 2 16:24:07 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 16:24:07 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: <20140902153710.GB374@redhat.com> Message-ID: In retrieve_plocks_stored() there is the code: retrieve_plocks(ls, &sig); if ((hd->flags & DLM_MFLG_PLOCK_SIG) && (sig != hd->msgdata2)) { log_error("lockspace %s plock disabled our sig %x " "nodeid %d sig %x", ls->name, sig, hd->nodeid, hd->msgdata2); ls->disable_plock = 1; ls->need_plocks = 1; /* don't set HAVEPLOCK */ ls->save_plocks = 0; return; } Node 1 is getting rc=12 from saCkptCheckpointOpen (SA_AIS_ERR_NOT_EXIST). However, this error is ignored and we process the sig value as if is valid rather than an uninitialized value that was never set by the retrieve_plocks() function. So I guess the question is why can't it find the checkpoint file and/or what is the correct action when the sig value cannot be retrieved? Neale On Sep 2, 2014, at 12:02 PM, Neale Ferguson wrote: > Thanks David, > That makes sense as there's this message that precedes the disable message in the log: > > retrieve_plocks ckpt open error 12 lvclusdidiz0360 > > Neale > > On Sep 2, 2014, at 11:37 AM, David Teigland wrote: > >> On Tue, Sep 02, 2014 at 02:56:52PM +0000, Neale Ferguson wrote: >> >>> 1409631951 lockspace lvclusdidiz0360 >>> plock disabled our sig 816fba01 nodeid 2 sig 2f6b >> >> There is a difference in plock data signatures between the node that wrote >> the data and the node that read it (this one). This indicates that the >> plock data was not synced correctly by the openais/corosync checkpoints, >> or that the signatures were not synced correctly (e.g bug 623816). >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Tue Sep 2 16:42:36 2014 From: teigland at redhat.com (David Teigland) Date: Tue, 2 Sep 2014 11:42:36 -0500 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: <20140902153710.GB374@redhat.com> Message-ID: <20140902164236.GD374@redhat.com> On Tue, Sep 02, 2014 at 04:24:07PM +0000, Neale Ferguson wrote: > In retrieve_plocks_stored() there is the code: > > retrieve_plocks(ls, &sig); > > if ((hd->flags & DLM_MFLG_PLOCK_SIG) && (sig != hd->msgdata2)) { > log_error("lockspace %s plock disabled our sig %x " > "nodeid %d sig %x", ls->name, sig, hd->nodeid, > hd->msgdata2); > ls->disable_plock = 1; > ls->need_plocks = 1; /* don't set HAVEPLOCK */ > ls->save_plocks = 0; > return; > } We need to sort out which nodes are sending/receiving plock data to/from each other. The way it's supposed to work, is an existing node is supposed to write its plock data into a checkpoint, then do send_plocks_stored() to notify the new node that the data is ready. The new node is then supposed to receive_plocks_stored(), and read the plock data from the checkpoint. I could get a better picture if you save and send the output of dlm_tool dump > dlm_dump.txt dlm_tool log_plock > dlm_plock.txt after the problem occurs. > Node 1 is getting rc=12 from saCkptCheckpointOpen > (SA_AIS_ERR_NOT_EXIST). However, this error is ignored and we process > the sig value as if is valid rather than an uninitialized value that was > never set by the retrieve_plocks() function. So I guess the question is > why can't it find the checkpoint file and/or what is the correct action > when the sig value cannot be retrieved? > > Neale > > On Sep 2, 2014, at 12:02 PM, Neale Ferguson wrote: > > > Thanks David, > > That makes sense as there's this message that precedes the disable message in the log: > > > > retrieve_plocks ckpt open error 12 lvclusdidiz0360 > > > > Neale > > > > On Sep 2, 2014, at 11:37 AM, David Teigland wrote: > > > >> On Tue, Sep 02, 2014 at 02:56:52PM +0000, Neale Ferguson wrote: > >> > >>> 1409631951 lockspace lvclusdidiz0360 > >>> plock disabled our sig 816fba01 nodeid 2 sig 2f6b > >> > >> There is a difference in plock data signatures between the node that wrote > >> the data and the node that read it (this one). This indicates that the > >> plock data was not synced correctly by the openais/corosync checkpoints, > >> or that the signatures were not synced correctly (e.g bug 623816). From neale at sinenomine.net Tue Sep 2 16:48:22 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Tue, 2 Sep 2014 16:48:22 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <20140902164236.GD374@redhat.com> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> Message-ID: The logs from the recovering node are attached. If you need the same from the other node I will get them tonight. On Sep 2, 2014, at 12:42 PM, David Teigland wrote: > We need to sort out which nodes are sending/receiving plock data to/from > each other. The way it's supposed to work, is an existing node is > supposed to write its plock data into a checkpoint, then do > send_plocks_stored() to notify the new node that the data is ready. The > new node is then supposed to receive_plocks_stored(), and read the plock > data from the checkpoint. > > I could get a better picture if you save and send the output of > dlm_tool dump > dlm_dump.txt > dlm_tool log_plock > dlm_plock.txt > > after the problem occurs. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dlm_log_plock.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dlm_dump.txt URL: From urgrue at bulbous.org Wed Sep 3 09:07:30 2014 From: urgrue at bulbous.org (urgrue) Date: Wed, 03 Sep 2014 11:07:30 +0200 Subject: [Linux-cluster] Possible to apply changes without restart? Message-ID: <1409735250.4045060.163056409.18075E72@webmail.messagingengine.com> Hi, Using cman/rgmanager in RHEL6 - is it possible to add a resource to my service and have it be picked up and started without having to restart cman/rgmanager? I thought ccs --activate did this, and the rgmanager.log does output: Sep 03 10:50:07 rgmanager Stopping changed resources. Sep 03 10:50:07 rgmanager Restarting changed resources. Sep 03 10:50:07 rgmanager Starting changed resources. But the resource I added is not running nor is there any mention of it in the logs. Is it normal or did I do something wrong? From white.heron at yahoo.com Wed Sep 3 16:32:44 2014 From: white.heron at yahoo.com (Tan Sri Dato' Eur Ing Adli) Date: Thu, 4 Sep 2014 00:32:44 +0800 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> Message-ID: <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> Well, gimme the other node log too. Sent from my iPhone > On Sep 3, 2014, at 12:48 AM, Neale Ferguson wrote: > > The logs from the recovering node are attached. If you need the same from the other node I will get them tonight. > >> On Sep 2, 2014, at 12:42 PM, David Teigland wrote: >> >> We need to sort out which nodes are sending/receiving plock data to/from >> each other. The way it's supposed to work, is an existing node is >> supposed to write its plock data into a checkpoint, then do >> send_plocks_stored() to notify the new node that the data is ready. The >> new node is then supposed to receive_plocks_stored(), and read the plock >> data from the checkpoint. >> >> I could get a better picture if you save and send the output of >> dlm_tool dump > dlm_dump.txt >> dlm_tool log_plock > dlm_plock.txt >> >> after the problem occurs. > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From neale at sinenomine.net Wed Sep 3 16:50:20 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Wed, 3 Sep 2014 16:50:20 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> Message-ID: <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> Will do. Having trouble accessing that system at the moment. I hope to get it later today. Neale On Sep 3, 2014, at 12:32 PM, Tan Sri Dato' Eur Ing Adli wrote: > > Well, gimme the other node log too. > > > Sent from my iPhone > >> On Sep 3, 2014, at 12:48 AM, Neale Ferguson wrote: >> >> The logs from the recovering node are attached. If you need the same from the other node I will get them tonight. >> >>> On Sep 2, 2014, at 12:42 PM, David Teigland wrote: >>> >>> We need to sort out which nodes are sending/receiving plock data to/from >>> each other. The way it's supposed to work, is an existing node is >>> supposed to write its plock data into a checkpoint, then do >>> send_plocks_stored() to notify the new node that the data is ready. The >>> new node is then supposed to receive_plocks_stored(), and read the plock >>> data from the checkpoint. >>> >>> I could get a better picture if you save and send the output of >>> dlm_tool dump > dlm_dump.txt >>> dlm_tool log_plock > dlm_plock.txt >>> >>> after the problem occurs. >> >> >> From white.heron at yahoo.com Thu Sep 4 16:41:43 2014 From: white.heron at yahoo.com (Tan Sri Dato' Eur Ing Adli) Date: Fri, 5 Sep 2014 00:41:43 +0800 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> Message-ID: Ur cluster system got database? Is it staging server or production server? Don't hesitate to call me at 60173623661 Sent from my iPhone > On Sep 4, 2014, at 12:50 AM, Neale Ferguson wrote: > > Will do. Having trouble accessing that system at the moment. I hope to get it later today. > > Neale > >> On Sep 3, 2014, at 12:32 PM, Tan Sri Dato' Eur Ing Adli wrote: >> >> >> Well, gimme the other node log too. >> >> >> Sent from my iPhone >> >>> On Sep 3, 2014, at 12:48 AM, Neale Ferguson wrote: >>> >>> The logs from the recovering node are attached. If you need the same from the other node I will get them tonight. >>> >>>> On Sep 2, 2014, at 12:42 PM, David Teigland wrote: >>>> >>>> We need to sort out which nodes are sending/receiving plock data to/from >>>> each other. The way it's supposed to work, is an existing node is >>>> supposed to write its plock data into a checkpoint, then do >>>> send_plocks_stored() to notify the new node that the data is ready. The >>>> new node is then supposed to receive_plocks_stored(), and read the plock >>>> data from the checkpoint. >>>> >>>> I could get a better picture if you save and send the output of >>>> dlm_tool dump > dlm_dump.txt >>>> dlm_tool log_plock > dlm_plock.txt >>>> >>>> after the problem occurs. >>> >>> >>> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From anprice at redhat.com Fri Sep 5 18:34:13 2014 From: anprice at redhat.com (Andrew Price) Date: Fri, 05 Sep 2014 19:34:13 +0100 Subject: [Linux-cluster] gfs2-utils 3.1.7 released Message-ID: <540A0225.8060200@redhat.com> Hi, I'm happy to announce that gfs2-utils 3.1.7 has been released. Notable changes include: * Journal layout and performance improvements in mkfs.gfs2: mkfs.gfs2 now allocates and writes resource groups and journal blocks in-order, improving the time it takes to create a gfs2 file system. Journals are laid out as single extents; when the journal size is larger than the resource group size, the initial resource groups are sized specifically to accommodate contiguous journals. * fsck.gfs2 performance and other improvements fsck.gfs2 now makes use of read-ahead to speed up inode processing which can improve the performance of fsck.gfs2 runs for certain workloads. It also includes timing and logging instrumentation in order to help debug and monitor performance changes in fsck.gfs2. Handling of journal indirect block corruption has also been improved. * Test suite enhancements gfs2-utils now uses GNU Autotest which is a testing framework integrated with Autotools-based build systems. More new test cases have been added and we will continue to improve coverage over time. * Bug fixes Many bug fixes and improvements across the board based on testing, user feedback and static analysis results. See below for a full list of changes. The source tarball is available from: https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.7.tar.gz Please test, and do make sure to report bugs, whether they're crashers or typos. Please file them against the gfs2-utils component of Fedora rawhide: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide Regards, Andy * Changes since 3.1.6: Abhi Das (3): libgfs2: patch to update gfs1 superblock correctly gfs2-utils: check and fix bad dinode pointers in gfs1 sb fsck.gfs2: fix corner case sb_seg_size correction for single journal fs Andrew Price (79): gfs2_grow: Don't try to open an empty string gfs2-utils tests: Switch to autotest fsck.gfs2: Allocate enough space for the block map libgfs2: Move the BLOCKMAP_* macros into fsck.gfs2 libgfs2: Add sd_heightsize bounds checking in read_sb fsck.gfs2: Fix block size validation gfs2l: Build with -D_FILE_OFFSET_BITS=64 libgfs2: Add lgfs2_open_mnt* functions Switch is_pathname_mounted callers to lgfs2_open_mnt* libgfs2: Remove is_pathname_mounted libgfs2: Remove sdp argument from compute_heightsize libgfs2: Remove sdp and j arguments from write_journal libgfs2: Rework find_metapath libgfs2: Improve and simplify blk_alloc_in_rg mkfs.gfs2 tests: Enable debug output libgfs2: Refactor block allocation functions gfs2-utils: Clean up unused functions libgfs2: Remove exit call from build_rgrps gfs2_edit: Convert fssize to bytes before reporting fs size gfs2-utils: Remove duplicate definitions of gfs1 structs libgfs2: Superblock building and writing improvements gfs2-utils: Ensure sb_uuid uses are guarded libgfs2: Add support for new leaf hint fields mkfs.gfs2: Remove a dead structure libgfs2: Remove another exit() call gfs2-utils: Fix up some errors reported by clang gfs2_edit: Use the metadata description in get_block_type gfs2_edit: More static analysis fixes gfs2-utils: Fail to configure if flex is not found mkfs.gfs2: Make dev a member of mkfs_opts libgfs2: Add lgfs2_space_for_data() libgfs2: Don't try to read more than IOV_MAX iovecs mkfs.gfs2: Fix the resource group layout strategy, again libgfs2: Don't call gfs2_blk2rgrpd in gfs2_set_bitmap libgfs2: Add abstractions for rgrp tree traversal libgfs2: Split out the rindex calculation from lgfs2_rgrp_append libgfs2: Consolidate rgrp_tree and bitstruct allocations libgfs2: Add a lgfs2_rindex_read_fd() function libgfs2: Const-ify the 'ri' argument to gfs2_rindex_out libgfs2: Fix off-by-one in lgfs2_rgrps_plan libgfs2: Stick to the (rgrp) plan in lgfs2_rindex_entry_new gfs2_grow: Migrate to the new resource group API gfs2_grow: Add stripe alignment fsck.gfs2: Log to syslog on start and exit gfs2-utils: Expressly expunge 'expert mode' libgfs2: Remove UI fields from struct gfs2_sbd libgfs2: Remove debug field from gfs2_sbd libgfs2: Remove logging API gfs2_edit: Add a savemeta file metadata header gfs2_edit: savemeta and restoremeta improvements gfs2_edit: Fix parsing the savemeta -z option gfs2-utils: Fix two logic errors gfs2_edit: Ensure all leaf blocks in per_node are saved gfs2-utils tests: Add small-block savemeta tests libgfs2: Zero de_inum.no_addr when deleting dirents libgfs2: Keep a pointer to the sbd in lgfs2_rgrps_t libgfs2: Move bitmap buffers inside struct gfs2_bitmap libgfs2: Fix an impossible loop condition in gfs2_rgrp_read libgfs2: Introduce struct lgfs2_rbm libgfs2: Move struct _lgfs2_rgrps into rgrp.h libgfs2: Add functions for finding free extents tests: Add unit tests for the new extent search functions libgfs2: Ignore an empty rgrp plan if a length is specified libgfs2: Add back-pointer to rgrps in lgfs2_rgrp_t libgfs2: Const-ify the parameters of print functions libgfs2: Allow init_dinode to accept a preallocated bh libgfs2: Add extent allocation functions libgfs2: Add support for allocating entire rgrp headers libgfs2: Write file metadata sequentially libgfs2: Fix alignment in lgfs2_rgsize_for_data libgfs2: Handle non-zero bitmaps in lgfs2_rgrp_write libgfs2: Add a speedier journal data block writing function libgfs2: Create jindex directory separately from journals mkfs.gfs2: Improve journal creation performance mkfs.gfs2: Don't search the bitmaps to allocate journals gfs2l: Fix uninitialised string warning gfs2-utils: Remove target.mk files gfs2-utils: Update translation template libgfs2: Move unused gfs2_bmap into fsck.gfs2 Bob Peterson (14): fsck.gfs2: Add ability to detect journal inode indirect block corruption gfs2_tool: catch interrupts while the metafs is mounted gfs2_edit: Add "journals" option to print journal info fsck.gfs2: Check and repair per_node contents such as quota_changeX gfs2_edit: Separate out the journal-related functions to journal.c gfs2_edit: Add more intelligence to journal dumps gfs2_edit: Report referencing block address in the new journal code gfs2_edit: Improve log descriptor reference code gfs2_edit: mark log headers with the unmounted flag gfs2_edit: print LB (log descriptor continuation blocks) for GFS2 gfs2_edit: Print block types with log descriptors fsck.gfs2: time each of the passes fsck.gfs2: Issue read-ahead for dinodes in each bitmap fsck.gfs2: File read-ahead Shane Bradley (1): gfs2_lockcapture: Fixed header to use absolute path and modified some options From fdinitto at redhat.com Mon Sep 8 10:30:23 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Mon, 08 Sep 2014 12:30:23 +0200 Subject: [Linux-cluster] [RFC] Organizing HA Summit 2015 Message-ID: <540D853F.3090109@redhat.com> All, it's been almost 6 years since we had a face to face meeting for all developers and vendors involved in Linux HA. I'd like to try and organize a new event and piggy-back with DevConf in Brno [1]. DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. My suggestion would be to have a 2 days dedicated HA summit the 4th and the 5th of February. The goal for this meeting is to, beside to get to know each other and all social aspect of those events, tune the directions of the various HA projects and explore common areas of improvements. I am also very open to the idea of extending to 3 days, 1 one dedicated to customers/users and 2 dedicated to developers, by starting the 3rd. Thoughts? Fabio PS Please hit reply all or include me in CC just to make sure I'll see an answer :) [1] http://devconf.cz/ From lists at alteeve.ca Mon Sep 8 13:30:52 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 08 Sep 2014 09:30:52 -0400 Subject: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015 In-Reply-To: <540D853F.3090109@redhat.com> References: <540D853F.3090109@redhat.com> Message-ID: <540DAF8C.8060002@alteeve.ca> On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote: > All, > > it's been almost 6 years since we had a face to face meeting for all > developers and vendors involved in Linux HA. > > I'd like to try and organize a new event and piggy-back with DevConf in > Brno [1]. > > DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. > > My suggestion would be to have a 2 days dedicated HA summit the 4th and > the 5th of February. > > The goal for this meeting is to, beside to get to know each other and > all social aspect of those events, tune the directions of the various HA > projects and explore common areas of improvements. > > I am also very open to the idea of extending to 3 days, 1 one dedicated > to customers/users and 2 dedicated to developers, by starting the 3rd. > > Thoughts? > > Fabio > > PS Please hit reply all or include me in CC just to make sure I'll see > an answer :) > > [1] http://devconf.cz/ I think this is a good idea. 3 days may be a good idea, as well. I think I would be more useful trying to bring the user's perspective more so than a developer's. So on that, I would like to propose a discussion on merging some of the disparate lists, channels, sites, etc. to help simplify life for new users looking for help from or to wanting to join the HA community. I also understand that Fabio will buy the first round of drinks. >:) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From neale at sinenomine.net Mon Sep 8 14:44:49 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 8 Sep 2014 14:44:49 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> Message-ID: <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> Further to the problem described last week. What I'm seeing is that the node (NODE2) that keeps going when NODE1 fails has many entries in dlm_tool log_plocks output: 1410147734 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147734 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147734 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147736 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147736 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147736 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147738 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147738 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147738 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147740 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147740 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147740 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 1410147742 lvclusdidiz0360 receive plock 10303 LK WR 0-7fffffffffffffff 1/8112/13d5000 w 0 1410147742 lvclusdidiz0360 receive plock 1305c LK WR 0-7fffffffffffffff 1/8147/1390400 w 0 1410147742 lvclusdidiz0360 receive plock 50081 LK WR 0-7fffffffffffffff 1/8182/7ce04400 w 0 i.e. with no corresponding unlock entry. NODE1 is brought down by init 6 and when it restarts it gets as far as "Starting cman" before NODE2 fences it (I assume we need a higher post_join_delay). When the node is fenced I see: 1410147774 clvmd purged 0 plocks for 1 1410147774 lvclusdidiz0360 purged 3 plocks for 1 So it looks like it tried to some clean up but then when NODE1 attempts to join NODE2 examines the lockspace and reports the following: 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78067.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78068.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r78059.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88464.0" 1410147820 lvclusdidiz0360 wr sect ro 0 rf 0 len 40 "r88478.0" 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 So it believes NODE1 will have 45 plocks to process when it comes back. NODE1 receives that plock information: 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 However, when NODE1 attempts to retrieve plocks it reports: 1410147820 lvclusdidiz0360 retrieve_plocks 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. I'm am try to understand the checkpointing process and from where this information is actually being retrieved. Neale From teigland at redhat.com Mon Sep 8 15:17:57 2014 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Sep 2014 10:17:57 -0500 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> Message-ID: <20140908151757.GC21311@redhat.com> On Mon, Sep 08, 2014 at 02:44:49PM +0000, Neale Ferguson wrote: > 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 > 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 > 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 > 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 > > However, when NODE1 attempts to retrieve plocks it reports: > > 1410147820 lvclusdidiz0360 retrieve_plocks > 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 You mentioned previously that it reported an error attempting to open the checkpoint (SA_AIS_ERR_NOT_EXIST) in retrieve_plocks. That's a slightly different error than successfully opening the checkpoint and finding it empty, although both have the same effect. Did the other node report an errors when it attempted to create this checkpoint? > Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. > > I'm am try to understand the checkpointing process and from where this information is actually being retrieved. The checkpoints have always been a source of problems, both from the user side in dlm_controld, and from the implementation in corosync/openais. I added the signatures to detect these problems more directly (and quit using checkpoints altogether in the RHEL7 version.) In this case it's not yet clear which side is responsible for the problem. If it's on the dlm_controld side, then it's probably related to unlinking or not unlinking a previous checkpoint, which causes subsequent failures when creating new checkpoints. From neale at sinenomine.net Mon Sep 8 15:35:05 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 8 Sep 2014 15:35:05 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <20140908151757.GC21311@redhat.com> References: <20140902153710.GB374@redhat.com> <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> <20140908151757.GC21311@redhat.com> Message-ID: On Sep 8, 2014, at 11:17 AM, David Teigland wrote: > On Mon, Sep 08, 2014 at 02:44:49PM +0000, Neale Ferguson wrote: >> 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 >> 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 0 > >> 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 0 to 2 >> 1410147820 lvclusdidiz0360 receive_plocks_stored 2:8 flags a sig 5ab0 need_plocks 1 >> >> However, when NODE1 attempts to retrieve plocks it reports: >> >> 1410147820 lvclusdidiz0360 retrieve_plocks >> 1410147820 lvclusdidiz0360 retrieve_plocks first 0 last 0 r_count 0 p_count 0 sig 0 > > You mentioned previously that it reported an error attempting to open the > checkpoint (SA_AIS_ERR_NOT_EXIST) in retrieve_plocks. That's a slightly > different error than successfully opening the checkpoint and finding it > empty, although both have the same effect. Did the other node report an > errors when it attempted to create this checkpoint? That problem still exists but it appears to be related to the clvmd lockspace. I'm still looking at this but I'm also looking at the lockspace that corresponds to the gfs2 target. Here's the what Node1 says about the clvmd checkpoint area: 1410147820 clvmd set_plock_ckpt_node from 0 to 2 1410147820 clvmd receive_plocks_stored 2:9 flags a sig 0 need_plocks 1 1410147820 clvmd match_change 2:9 matches cg 1 1410147820 clvmd retrieve_plocks 1410147820 retrieve_plocks ckpt open error 12 clvmd 1410147820 lockspace clvmd plock disabled our sig bbfa1301 nodeid 2 sig 0 Node 2 has this to say about clvmd: 1410147820 clvmd set_plock_ckpt_node from 2 to 2 1410147820 clvmd store_plocks saved ckpt uptodate 1410147820 clvmd store_plocks first 0 last 0 r_count 0 p_count 0 sig 0 1410147820 clvmd send_plocks_stored cg 9 flags a data2 0 counts 8 2 1 0 0 1410147820 clvmd receive_plocks_stored 2:9 flags a sig 0 need_plocks 0 As for the gfs2 lockspace, Node 2 reports this when dealing with the checkpoint area: 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 2 to 2 1410147820 lvclusdidiz0360 unlink ckpt 520eedd100000002 1410147820 lvclusdidiz0360 unlink ckpt error 12 lvclusdidiz0360 1410147820 lvclusdidiz0360 unlink ckpt status error 12 lvclusdidiz0360 1410147820 unlink ckpt 520eedd100000002 close err 12 lvclusdidiz0360 1410147820 lvclusdidiz0360 store_plocks r_count 45 p_count 63 total_size 2520 max_section_size 280 1410147820 lvclusdidiz0360 store_plocks open ckpt handle 6157409500000003 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 > >> Because of the mismatch between sig 0 and sig 5ab0 plocks get disabled and the F_SETLK operation on the gfs2 target will fail on NODE1. >> >> I'm am try to understand the checkpointing process and from where this information is actually being retrieved. > > The checkpoints have always been a source of problems, both from the user > side in dlm_controld, and from the implementation in corosync/openais. I > added the signatures to detect these problems more directly (and quit > using checkpoints altogether in the RHEL7 version.) In this case it's not > yet clear which side is responsible for the problem. If it's on the > dlm_controld side, then it's probably related to unlinking or not > unlinking a previous checkpoint, which causes subsequent failures when > creating new checkpoints. I'm still not groking the checkpoint process. Where is this checkpoint information kept? Also, when I try an imitate the situation by holding a R/W lock and then causing that node to restart without shutting down (and releasing the lock), the other node purges the lock when it detects the failing node has disappeared. I don't understand why the locks reported in the previous mail aren't purged as well. Thanks for your comments, every bit helps me understand. Neale From teigland at redhat.com Mon Sep 8 16:15:17 2014 From: teigland at redhat.com (David Teigland) Date: Mon, 8 Sep 2014 11:15:17 -0500 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: References: <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> <20140908151757.GC21311@redhat.com> Message-ID: <20140908161517.GD21311@redhat.com> On Mon, Sep 08, 2014 at 03:35:05PM +0000, Neale Ferguson wrote: > As for the gfs2 lockspace, Node 2 reports this when dealing with the checkpoint area: > > 1410147820 lvclusdidiz0360 set_plock_ckpt_node from 2 to 2 > 1410147820 lvclusdidiz0360 unlink ckpt 520eedd100000002 > 1410147820 lvclusdidiz0360 unlink ckpt error 12 lvclusdidiz0360 > 1410147820 lvclusdidiz0360 unlink ckpt status error 12 lvclusdidiz0360 > 1410147820 unlink ckpt 520eedd100000002 close err 12 lvclusdidiz0360 > 1410147820 lvclusdidiz0360 store_plocks r_count 45 p_count 63 total_size 2520 max_section_size 280 > 1410147820 lvclusdidiz0360 store_plocks open ckpt handle 6157409500000003 > 1410147820 lvclusdidiz0360 store_plocks first 66307 last 88478 r_count 45 p_count 63 sig 5ab0 Creating and writing the new checkpoint appears to have worked, despite the errors about unlinking the previous one. The problem may come after this when corosync attempts to transfer the checkpoint to the other node. > I'm still not groking the checkpoint process. Where is this checkpoint > information kept? The checkpoint data is sent to corosync/openais, which is responsible for syncing that data to the other nodes, which should then be able to open and read it. You'll also want to look for corosync/openais errors related to checkpoints. > Also, when I try an imitate the situation by holding a R/W lock and then > causing that node to restart without shutting down (and releasing the > lock), the other node purges the lock when it detects the failing node > has disappeared. I don't understand why the locks reported in the > previous mail aren't purged as well. The problem is almost certainly with the operation of the checkpoints, not with the locking. From neale at sinenomine.net Mon Sep 8 17:02:28 2014 From: neale at sinenomine.net (Neale Ferguson) Date: Mon, 8 Sep 2014 17:02:28 +0000 Subject: [Linux-cluster] F_SETLK fails after recovery In-Reply-To: <20140908161517.GD21311@redhat.com> References: <20140902164236.GD374@redhat.com> <34E6004E-692B-4B3B-A71A-7F5D30615836@yahoo.com> <18097A19-70D6-4169-AB71-3C188C90EB58@sinenomine.net> <45B4E9CB-51EC-4D76-A266-A6CA8FBBA5A6@sinenomine.net> <62A6C03B-BC53-453A-81DE-821FB58F0BA9@sinenomine.net> <20140908151757.GC21311@redhat.com> <20140908161517.GD21311@redhat.com> Message-ID: <3049AAB1-E9DA-47EF-A16F-FF161B0DA9D8@sinenomine.net> Will do. I'm struggling to understand the mechanics of checkpointing. When I call saCktCheckpointOpen etc. what are the entities they are dealing with? Is this information centralized on the master and disseminated to the other members? Does the information only reside in memory or is it written anywhere? I suppose what I'm asking for is there doc on openais internals that will explain this to me rather than asking naive and repetitive questions? Also, would setting: In cluster.conf capture what I need to help track this down or are there some additional entries in the section required? Thanks so much for taking the time to respond... Neale On Sep 8, 2014, at 12:15 PM, David Teigland wrote: > On Mon, Sep 08, 2014 at 03:35:05PM +0000, Neale Ferguson wrote: > > The checkpoint data is sent to corosync/openais, which is responsible for > syncing that data to the other nodes, which should then be able to open > and read it. You'll also want to look for corosync/openais errors related > to checkpoints. > >> Also, when I try an imitate the situation by holding a R/W lock and then >> causing that node to restart without shutting down (and releasing the >> lock), the other node purges the lock when it detects the failing node >> has disappeared. I don't understand why the locks reported in the >> previous mail aren't purged as well. > > The problem is almost certainly with the operation of the checkpoints, not > with the locking. > From amjadcsu at gmail.com Tue Sep 9 07:14:59 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Tue, 9 Sep 2014 10:14:59 +0300 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster Message-ID: Hi, I have setup a 2 node cluster using RHEL 6.5 . The cluster .conf looks like this The network is as follows: 1)Heartbeat (Bonding) between node 1 and node 2 using ethernet cables The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node 2. 2) IPMI. This is used for fencing and addresses are 10.10.63.93 and 10.10.63.92 3) External ethernet connected to 10.10.5.x network. If i do fence_node , then fencing works, However if i physically shutdown active node, the passive node also shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown and have to be physically rebooted. Any thing i am doing wrong ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From prasathslt at gmail.com Tue Sep 9 07:37:18 2014 From: prasathslt at gmail.com (Sivaji Prasath) Date: Tue, 9 Sep 2014 13:07:18 +0530 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: Message-ID: Hi, Is this two connected crossover cable with bonding ? Do you have switch in the middle ? Note: Red Hat does not support use of a crossover cable for cluster communication. On 9 September 2014 12:44, Amjad Syed wrote: > Hi, > > I have setup a 2 node cluster using RHEL 6.5 . > > The cluster .conf looks like this > > > > > > > > login="ADMIN" name="inspuripmi" passwd="XXXXX/> > login="test" name="hpipmi" passwd="XXXXX"/> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The network is as follows: > > 1)Heartbeat (Bonding) between node 1 and node 2 using ethernet cables > > The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node 2. > > 2) IPMI. This is used for fencing and addresses are 10.10.63.93 and > 10.10.63.92 > > 3) External ethernet connected to 10.10.5.x network. > > If i do fence_node , then fencing works, > However if i physically shutdown active node, the passive node also > shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown > and have to be physically rebooted. > > Any thing i am doing wrong ? > > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From emi2fast at gmail.com Tue Sep 9 07:42:29 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Tue, 9 Sep 2014 09:42:29 +0200 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: Message-ID: you can use fence delay, for one of two node 2014-09-09 9:37 GMT+02:00 Sivaji Prasath : > Hi, > > Is this two connected crossover cable with bonding ? Do you have switch in > the middle ? > > Note: Red Hat does not support use of a crossover cable for cluster > communication. > > On 9 September 2014 12:44, Amjad Syed wrote: >> >> Hi, >> >> I have setup a 2 node cluster using RHEL 6.5 . >> >> The cluster .conf looks like this >> >> >> >> >> >> >> >> > login="ADMIN" name="inspuripmi" passwd="XXXXX/> >> > login="test" name="hpipmi" passwd="XXXXX"/> >> >> >> >> >> >> >> > ="reboot"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > recovery="relocate"> >> > sleeptime="10"/> >> >> >> >> >> >> >> The network is as follows: >> >> 1)Heartbeat (Bonding) between node 1 and node 2 using ethernet cables >> >> The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node >> 2. >> >> 2) IPMI. This is used for fencing and addresses are 10.10.63.93 and >> 10.10.63.92 >> >> 3) External ethernet connected to 10.10.5.x network. >> >> If i do fence_node , then fencing works, >> However if i physically shutdown active node, the passive node also >> shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown >> and have to be physically rebooted. >> >> Any thing i am doing wrong ? >> >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera From amjadcsu at gmail.com Tue Sep 9 07:49:35 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Tue, 9 Sep 2014 10:49:35 +0300 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: Message-ID: Yes, the two are connected crossover cable with bonding. There is not switch in middle. So you mean to say i need to put a switch in middle with crossover cable for cluster communication ? On Tue, Sep 9, 2014 at 10:37 AM, Sivaji Prasath wrote: > Hi, > > Is this two connected crossover cable with bonding ? Do you have switch in > the middle ? > > Note: Red Hat does not support use of a crossover cable for cluster > communication. > > On 9 September 2014 12:44, Amjad Syed wrote: > >> Hi, >> >> I have setup a 2 node cluster using RHEL 6.5 . >> >> The cluster .conf looks like this >> >> >> >> >> >> >> >> > login="ADMIN" name="inspuripmi" passwd="XXXXX/> >> > login="test" name="hpipmi" passwd="XXXXX"/> >> >> >> >> >> >> >> > ="reboot"/> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > recovery="relocate"> >> > sleeptime="10"/> >> >> >> >> >> >> >> The network is as follows: >> >> 1)Heartbeat (Bonding) between node 1 and node 2 using ethernet cables >> >> The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node >> 2. >> >> 2) IPMI. This is used for fencing and addresses are 10.10.63.93 and >> 10.10.63.92 >> >> 3) External ethernet connected to 10.10.5.x network. >> >> If i do fence_node , then fencing works, >> However if i physically shutdown active node, the passive node also >> shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown >> and have to be physically rebooted. >> >> Any thing i am doing wrong ? >> >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prasathslt at gmail.com Tue Sep 9 07:58:17 2014 From: prasathslt at gmail.com (Sivaji Prasath) Date: Tue, 9 Sep 2014 13:28:17 +0530 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: Message-ID: Hi, Of course. As per the Redhat recommendation you have to put the switch in the middle between two nodes. You can read the solution and recommendation https://access.redhat.com/solutions/151203 Due to this reason only, your second server is rebooting. Best Regards, S.Prasath On 9 September 2014 13:19, Amjad Syed wrote: > Yes, the two are connected crossover cable with bonding. There is not > switch in middle. > > So you mean to say i need to put a switch in middle with crossover cable > for cluster communication ? > > On Tue, Sep 9, 2014 at 10:37 AM, Sivaji Prasath > wrote: > >> Hi, >> >> Is this two connected crossover cable with bonding ? Do you have switch >> in the middle ? >> >> Note: Red Hat does not support use of a crossover cable for cluster >> communication. >> >> On 9 September 2014 12:44, Amjad Syed wrote: >> >>> Hi, >>> >>> I have setup a 2 node cluster using RHEL 6.5 . >>> >>> The cluster .conf looks like this >>> >>> >>> >>> >>> >>> >>> >>> >> login="ADMIN" name="inspuripmi" passwd="XXXXX/> >>> >> login="test" name="hpipmi" passwd="XXXXX"/> >>> >>> >>> >>> >>> >>> >>> >> ="reboot"/> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> recovery="relocate"> >>> >> sleeptime="10"/> >>> >>> >>> >>> >>> >>> >>> The network is as follows: >>> >>> 1)Heartbeat (Bonding) between node 1 and node 2 using ethernet cables >>> >>> The ip addresses are 192.168.10.11 and 192.168.10.10 for node 1 and node >>> 2. >>> >>> 2) IPMI. This is used for fencing and addresses are 10.10.63.93 and >>> 10.10.63.92 >>> >>> 3) External ethernet connected to 10.10.5.x network. >>> >>> If i do fence_node , then fencing works, >>> However if i physically shutdown active node, the passive node also >>> shutdowns. Even if i do ifdown bond0 (on active node), both node shutdown >>> and have to be physically rebooted. >>> >>> Any thing i am doing wrong ? >>> >>> >>> >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Tue Sep 9 08:53:15 2014 From: lists at alteeve.ca (Digimer) Date: Tue, 09 Sep 2014 04:53:15 -0400 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: Message-ID: <540EBFFB.4040100@alteeve.ca> On 09/09/14 03:14 AM, Amjad Syed wrote: > Something is breaking the network during the shutdown, a fence is being called and both nodes are killing the other, causing a dual fence. So you have a set of problems, I think. First, disable acpid on both nodes. Second, change the quoted line (only) to: If I am right, this will mean that 192.168.10.10 will stay up (fence) .11 Third, what bonding mode are you using? I would only use mode=1. Forth, please set the node names to match 'uname -n' on both nodes. Be sure the names translate to the IPs you want (via /etc/hosts, ideally). Fifth, as Sivaji suggested, please put switch(es) between the nodes. If it still tries to fence when a node shuts down (watch /var/log/messages and look for 'fencing node ...'), please paste your logs from both nodes. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From alanr at unix.sh Tue Sep 9 13:11:50 2014 From: alanr at unix.sh (Alan Robertson) Date: Tue, 09 Sep 2014 07:11:50 -0600 Subject: [Linux-cluster] [Linux-HA] [RFC] Organizing HA Summit 2015 In-Reply-To: <540D853F.3090109@redhat.com> References: <540D853F.3090109@redhat.com> Message-ID: <540EFC96.7010606@unix.sh> Hi Fabio, Do you know much about the Brno DevConf? I was wondering if the Assimilation Project might be interesting to the audience there. http://assimilationsystems.com/ http://assimproj.org/ It's related to High Availability in that we monitor systems and services with zero configuration - we even use OCF RAs ;-). Because of that, we could eventually intervene in systems - restarting services, or even migrating them. That's not in current plans, but it is technically very possible. But it's so much more than that - and HUGELY scalable - 10K servers without breathing hard, and 100K servers without proxies, etc. It also discovers systems, services, dependencies, switch connections, and lots of other things. Basically everything is done with near-zero configuration. We wind up with a graph database describing everything in great detail - and it's continually up to date. I don't know if you know me, but I founded the Linux-HA project and led it for about 10 years. -- Alan Robertson alanr at unix.sh On 09/08/2014 04:30 AM, Fabio M. Di Nitto wrote: > All, > > it's been almost 6 years since we had a face to face meeting for all > developers and vendors involved in Linux HA. > > I'd like to try and organize a new event and piggy-back with DevConf in > Brno [1]. > > DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. > > My suggestion would be to have a 2 days dedicated HA summit the 4th and > the 5th of February. > > The goal for this meeting is to, beside to get to know each other and > all social aspect of those events, tune the directions of the various HA > projects and explore common areas of improvements. > > I am also very open to the idea of extending to 3 days, 1 one dedicated > to customers/users and 2 dedicated to developers, by starting the 3rd. > > Thoughts? > > Fabio > > PS Please hit reply all or include me in CC just to make sure I'll see > an answer :) > > [1] http://devconf.cz/ > _______________________________________________ > Linux-HA mailing list > Linux-HA at lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems From fdinitto at redhat.com Tue Sep 9 14:09:16 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 09 Sep 2014 16:09:16 +0200 Subject: [Linux-cluster] [Linux-HA] [RFC] Organizing HA Summit 2015 In-Reply-To: <540EFC96.7010606@unix.sh> References: <540D853F.3090109@redhat.com> <540EFC96.7010606@unix.sh> Message-ID: <540F0A0C.9080005@redhat.com> Hi Alan, On 09/09/2014 03:11 PM, Alan Robertson wrote: > Hi Fabio, > > Do you know much about the Brno DevConf? It would be my first visit to DevConf so not much really :) > > I was wondering if the Assimilation Project might be interesting to the > audience there. > http://assimilationsystems.com/ > http://assimproj.org/ > > It's related to High Availability in that we monitor systems and > services with zero configuration - we even use OCF RAs ;-). Because of > that, we could eventually intervene in systems - restarting services, or > even migrating them. That's not in current plans, but it is technically > very possible. I don't see why not. HA Summit != pacemaker ;) Having a pool of presentations from other HA related project would be cool. > > But it's so much more than that - and HUGELY scalable - 10K servers > without breathing hard, and 100K servers without proxies, etc. It also > discovers systems, services, dependencies, switch connections, and lots > of other things. Basically everything is done with near-zero > configuration. We wind up with a graph database describing everything > in great detail - and it's continually up to date. sounds interesting. Would you be willing to join us for a presentation/demo? > > I don't know if you know me, but I founded the Linux-HA project and led > it for about 10 years. Yeps, your name is very well known :) Cheers Fabio > > -- Alan Robertson > alanr at unix.sh > > > On 09/08/2014 04:30 AM, Fabio M. Di Nitto wrote: >> All, >> >> it's been almost 6 years since we had a face to face meeting for all >> developers and vendors involved in Linux HA. >> >> I'd like to try and organize a new event and piggy-back with DevConf in >> Brno [1]. >> >> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. >> >> My suggestion would be to have a 2 days dedicated HA summit the 4th and >> the 5th of February. >> >> The goal for this meeting is to, beside to get to know each other and >> all social aspect of those events, tune the directions of the various HA >> projects and explore common areas of improvements. >> >> I am also very open to the idea of extending to 3 days, 1 one dedicated >> to customers/users and 2 dedicated to developers, by starting the 3rd. >> >> Thoughts? >> >> Fabio >> >> PS Please hit reply all or include me in CC just to make sure I'll see >> an answer :) >> >> [1] http://devconf.cz/ >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA at lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems From alanr at unix.sh Tue Sep 9 16:31:41 2014 From: alanr at unix.sh (Alan Robertson) Date: Tue, 09 Sep 2014 10:31:41 -0600 Subject: [Linux-cluster] [Linux-HA] [RFC] Organizing HA Summit 2015 In-Reply-To: <540F0A0C.9080005@redhat.com> References: <540D853F.3090109@redhat.com> <540EFC96.7010606@unix.sh> <540F0A0C.9080005@redhat.com> Message-ID: <540F2B6D.2000606@unix.sh> My apologizes for spamming everyone. I thought I deleted all the other email addresses. I failed. Apologies :-( -- Alan Robertson alanr at unix.sh From fdinitto at redhat.com Tue Sep 9 16:36:51 2014 From: fdinitto at redhat.com (Fabio M. Di Nitto) Date: Tue, 09 Sep 2014 18:36:51 +0200 Subject: [Linux-cluster] [Linux-HA] [RFC] Organizing HA Summit 2015 In-Reply-To: <540F2B6D.2000606@unix.sh> References: <540D853F.3090109@redhat.com> <540EFC96.7010606@unix.sh> <540F0A0C.9080005@redhat.com> <540F2B6D.2000606@unix.sh> Message-ID: <540F2CA3.6050407@redhat.com> On 09/09/2014 06:31 PM, Alan Robertson wrote: > My apologizes for spamming everyone. > > I thought I deleted all the other email addresses. > > I failed. > > Apologies :-( I think it's good that we have an open discussion with all parties involved. I hardly fail to see that as an issue. Apologies not accepted ;) Fabio From lists at alteeve.ca Tue Sep 9 17:00:07 2014 From: lists at alteeve.ca (Digimer) Date: Tue, 09 Sep 2014 13:00:07 -0400 Subject: [Linux-cluster] [Pacemaker] [Linux-HA] [RFC] Organizing HA Summit 2015 In-Reply-To: <540F2CA3.6050407@redhat.com> References: <540D853F.3090109@redhat.com> <540EFC96.7010606@unix.sh> <540F0A0C.9080005@redhat.com> <540F2B6D.2000606@unix.sh> <540F2CA3.6050407@redhat.com> Message-ID: <540F3217.2000502@alteeve.ca> On 09/09/14 12:36 PM, Fabio M. Di Nitto wrote: > On 09/09/2014 06:31 PM, Alan Robertson wrote: >> My apologizes for spamming everyone. >> >> I thought I deleted all the other email addresses. >> >> I failed. >> >> Apologies :-( > > > I think it's good that we have an open discussion with all parties > involved. I hardly fail to see that as an issue. > > Apologies not accepted ;) > > Fabio +1 to err'ing on the side of too much talk. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From amjadcsu at gmail.com Wed Sep 10 08:28:00 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Wed, 10 Sep 2014 11:28:00 +0300 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: <540EBFFB.4040100@alteeve.ca> References: <540EBFFB.4040100@alteeve.ca> Message-ID: On Tue, Sep 9, 2014 at 11:53 AM, Digimer wrote: > On 09/09/14 03:14 AM, Amjad Syed wrote: > >> >> > > Something is breaking the network during the shutdown, a fence is being > called and both nodes are killing the other, causing a dual fence. So you > have a set of problems, I think. > > First, disable acpid on both nodes. > > Second, change the quoted line (only) to: > > > > If I am right, this will mean that 192.168.10.10 will stay up (fence) .11 > > Third, what bonding mode are you using? I would only use mode=1. > > Forth, please set the node names to match 'uname -n' on both nodes. Be > sure the names translate to the IPs you want (via /etc/hosts, ideally). > > Fifth, as Sivaji suggested, please put switch(es) between the nodes. > > If it still tries to fence when a node shuts down (watch /var/log/messages > and look for 'fencing node ...'), please paste your logs from both nodes. > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjadcsu at gmail.com Wed Sep 10 09:00:27 2014 From: amjadcsu at gmail.com (Amjad Syed) Date: Wed, 10 Sep 2014 12:00:27 +0300 Subject: [Linux-cluster] Physical shutdown of one node causes both node to crash in active/passive configuration of 2 node RHEL cluster In-Reply-To: References: <540EBFFB.4040100@alteeve.ca> Message-ID: Digimer, I have applied the changes but looks like it goes into fence loop. That means when node 1 is running cman and when reboot node2, it fences node1 and they get into a loop 1) On both nodes acpid is off krplporcl001 ~]# service acpid status acpid is stopped krplporcl002 ~]# service acpid status acpid is stopped 2) Changes in cluster .conf < 3) Bonding uses mode = 1 only on krplporcl001 : *DEVICE=bond0* *IPADDR=192.168.10.10* *NETMASK=255.255.255.0* *NETWORK=192.168.10.0* *BROADCAST=192.168.10.255* *BOOTPROTO=none* *Type=Ethernet* *ONBOOT=yes* *BONDING_OPTS='miimon=100 mode=1'* on krplporcl002 *DEVICE=bond0* *IPADDR=192.168.10.11* *NETMASK=255.255.255.0* *NETWORK=192.168.10.0* *BROADCAST=192.168.10.255* *BOOTPROTO=none* *Type=Ethernet* *ONBOOT=yes* *BONDING_OPTS='miimon=100 mode=1'* ~ 4) I have put one switch as sivaji suggested As soon as The logs on klrplporcl001 are as follows Sep 10 11:47:53 krplporcl001 fenced[5977]: fencing node krplporcl002 The logs on krplporcl002 are as follows : Sep 10 11:46:48 krplporcl002 fenced[2950]: fencing node krplporcl001 I am not sure why the network is breaking and why both nodes can not communicate with each other? Any places to look for logs etc? On Wed, Sep 10, 2014 at 11:28 AM, Amjad Syed wrote: > > > On Tue, Sep 9, 2014 at 11:53 AM, Digimer wrote: > >> On 09/09/14 03:14 AM, Amjad Syed wrote: >> >>> >>> >> >> Something is breaking the network during the shutdown, a fence is being >> called and both nodes are killing the other, causing a dual fence. So you >> have a set of problems, I think. >> >> First, disable acpid on both nodes. >> >> Second, change the quoted line (only) to: >> >> >> >> If I am right, this will mean that 192.168.10.10 will stay up (fence) .11 >> >> Third, what bonding mode are you using? I would only use mode=1. >> >> Forth, please set the node names to match 'uname -n' on both nodes. Be >> sure the names translate to the IPs you want (via /etc/hosts, ideally). >> >> Fifth, as Sivaji suggested, please put switch(es) between the nodes. >> >> If it still tries to fence when a node shuts down (watch >> /var/log/messages and look for 'fencing node ...'), please paste your logs >> from both nodes. >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Vallevand at UNISYS.com Tue Sep 16 21:20:25 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Tue, 16 Sep 2014 16:20:25 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. Is it possible to introduce a delay into cman or corosync startup? Is that even wise? Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? Any suggestions would be welcome. Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrew at beekhof.net Wed Sep 17 01:51:16 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Wed, 17 Sep 2014 11:51:16 +1000 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> Message-ID: <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can?t get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore 2. configure fencing 3. find a newer version of pacemaker, we're up to .12 now > A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. > > Now, this is not a good thing to have these particular resources running twice. I?d really like the clustering software to behave better. But, I?m not sure what ?behave better? would be. > > Is it possible to introduce a delay into cman or corosync startup? Is that even wise? > Is there a parameter to get the clustering software to poll more often when it can?t rejoin the cluster? > > Any suggestions would be welcome. > > Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. > > Regards. > Mark K Vallevand > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From wferi at niif.hu Wed Sep 17 11:36:44 2014 From: wferi at niif.hu (Ferenc Wagner) Date: Wed, 17 Sep 2014 13:36:44 +0200 Subject: [Linux-cluster] transition graph elements Message-ID: <87sijq8omr.fsf@lant.ki.iif.hu> Hi, Some cluster configuration helpers here do some simple transition graph analysis (no action planned or single resource start/restart). The information source is crm_simulate --save-graph. It works pretty well, but recently, after switching on utilization based resource placement, load_stopped_* pseudo events appeared in the graph even when it was beforehand an empty . The workaround was obvious, but I guess it's high time to seek out some definitive documentation about the transition graph XML. Is there anything of that sort available somewhere? If not, which part of the source shall I start looking at? -- Thanks, Feri. From Mark.Vallevand at UNISYS.com Wed Sep 17 14:34:50 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 09:34:50 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> Thanks. 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. Andrew: Thanks for the prompt response. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Tuesday, September 16, 2014 08:51 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore 2. configure fencing 3. find a newer version of pacemaker, we're up to .12 now > A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. > > Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. > > Is it possible to introduce a delay into cman or corosync startup? Is that even wise? > Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? > > Any suggestions would be welcome. > > Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. > > Regards. > Mark K Vallevand > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster From teigland at redhat.com Wed Sep 17 14:53:13 2014 From: teigland at redhat.com (David Teigland) Date: Wed, 17 Sep 2014 09:53:13 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> Message-ID: <20140917145313.GA23130@redhat.com> On Wed, Sep 17, 2014 at 09:34:50AM -0500, Vallevand, Mark K wrote: > this NIC stalling at boot time only lasts about 2 seconds beyond the > start of corosync. But, its 30 more seconds before the nodes see each > other. This is a common problem, and can often be fixed by increasing FENCED_MEMBER_DELAY in init.d/cman to the time it takes the nodes to converge. From Mark.Vallevand at UNISYS.com Wed Sep 17 15:06:34 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 10:06:34 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <20140917145313.GA23130@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <20140917145313.GA23130@redhat.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7CE9DD2@USEA-EXCH8.na.uis.unisys.com> So, set that in /etc/default/cman? It currently defaults to 45 seconds. Set it to a longer value? Not sure how that will help. There is also FENCE_JOIN_TIMEOUT, which defaults to 20 seconds. Would changing it help? But, it's easy to try things. Thanks! Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: David Teigland [mailto:teigland at redhat.com] Sent: Wednesday, September 17, 2014 09:53 AM To: Vallevand, Mark K Cc: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On Wed, Sep 17, 2014 at 09:34:50AM -0500, Vallevand, Mark K wrote: > this NIC stalling at boot time only lasts about 2 seconds beyond the > start of corosync. But, its 30 more seconds before the nodes see each > other. This is a common problem, and can often be fixed by increasing FENCED_MEMBER_DELAY in init.d/cman to the time it takes the nodes to converge. From Mark.Vallevand at UNISYS.com Wed Sep 17 15:20:54 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 10:20:54 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> Tried replacing the switch with a crossover cable. The problem goes away. It looks like there is some odd delay in the switch. The NIC is configured, but it takes 4 seconds for the link to go up. Huh. We have a dedicated network for all the cluster traffic. Nothing else uses it. In the two-node case, we use a cable. In larger clusters we will use a switch. First delivery is for two-node clusters. But, I worry about that slow switch. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K Sent: Tuesday, September 16, 2014 04:20 PM To: linux clustering Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. Is it possible to introduce a delay into cman or corosync startup? Is that even wise? Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? Any suggestions would be welcome. Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Nick.Fisk at sys-pro.co.uk Wed Sep 17 15:45:16 2014 From: Nick.Fisk at sys-pro.co.uk (Nick Fisk) Date: Wed, 17 Sep 2014 15:45:16 +0000 Subject: [Linux-cluster] Wrong Variable name in iSCSILogicalUnit Message-ID: Hi, I have been trying to create a HA iSCSILogicalUnit resource and think I have come across a bug caused by a wrong variable name. I have been using the master branch from cluster labs for my iSCSILogicalUnit resource agent running on Ubuntu 14.04. Whilst the LUN and Target are correctly created by the agent when stopping the agent it was only removing the target, which cleared the LUN but left the iBlock device. This was then locking the underlying block device as it was still in use. After spedning a fair while trawling through the agent I beleive I have discovered the problem, at least the change I made has fixed it for me. In the monitor and stop actions there is a check which uses the wrong variable, OCF_RESKEY_INSTANCE instead of OCF_RESOURCE_INSTANCE. I also found a "#{" in front of one of the variables that prepares the path string for removing the LUN. I have also added a few more log entries to give a clearer picture of what is happening during removal, which made the debugging process much easier. Below is a Diff which seems to fix the problem for me:- +++ /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit 2014-09-17 16:40:23.208764599 +0100 @@ -419,12 +419,14 @@ ${initiator} ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC fi done - lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_#{${OCF_RESKEY_lun}/" + lun_configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_${OCF_RESKEY_lun}/" if [ -e "${lun_configfs_path}" ]; then + ocf_log info "Deleting LUN ${OCF_RESKEY_target_iqn}/${OCF_RESKEY_lun}" ocf_run lio_node --dellun=${OCF_RESKEY_target_iqn} 1 ${OCF_RESKEY_lun} || exit $OCF_ERR_GENERIC fi - block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path" + block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path" if [ -e "${block_configfs_path}" ]; then + ocf_log info "Deleting iBlock Device iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}" ocf_run tcm_node --freedev=iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE} || exit $OCF_ERR_GENERIC fi ;; @@ -478,7 +480,7 @@ [ -e ${configfs_path} ] && [ `cat ${configfs_path}` = "${OCF_RESKEY_path}" ] && return $OCF_SUCCESS # if we aren't activated, is a block device still left over? - block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESKEY_INSTANCE}/udev_path" + block_configfs_path="/sys/kernel/config/target/core/iblock_${OCF_RESKEY_lio_iblock}/${OCF_RESOURCE_INSTANCE}/udev_path" [ -e ${block_configfs_path} ] && ocf_log warn "existing block without an active lun: ${block_configfs_path}" [ -e ${block_configfs_path} ] && return $OCF_ERR_GENERIC Nick Fisk Technical Support Engineer System Professional Ltd tel: 01825 830000 mob: 07711377522 fax: 01825 830001 mail: Nick.Fisk at sys-pro.co.uk web: www.sys-pro.co.uk IT SUPPORT SERVICES | VIRTUALISATION | STORAGE | BACKUP AND DR | IT CONSULTING Registered Office: Wilderness Barns, Wilderness Lane, Hadlow Down, East Sussex, TN22 4HU Registered in England and Wales. Company Number: 04754200 Confidentiality: This e-mail and its attachments are intended for the above named only and may be confidential. If they have come to you in error you must take no action based on them, nor must you copy or show them to anyone; please reply to this e-mail and highlight the error. Security Warning: Please note that this e-mail has been created in the knowledge that Internet e-mail is not a 100% secure communications medium. We advise that you understand and observe this lack of security when e-mailing us. Viruses: Although we have taken steps to ensure that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Any views expressed in this e-mail message are those of the individual and not necessarily those of the company or any of its subsidiaries. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ricks at alldigital.com Wed Sep 17 18:51:40 2014 From: ricks at alldigital.com (Rick Stevens) Date: Wed, 17 Sep 2014 11:51:40 -0700 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> Message-ID: <5419D83C.9020200@alldigital.com> On 09/17/2014 08:20 AM, Vallevand, Mark K issued this missive: > Tried replacing the switch with a crossover cable. The problem goes > away. It looks like there is some odd delay in the switch. The NIC is > configured, but it takes 4 seconds for the link to go up. Huh. > > We have a dedicated network for all the cluster traffic. Nothing else > uses it. In the two-node case, we use a cable. In larger clusters we > will use a switch. First delivery is for two-node clusters. But, I > worry about that slow switch. Switches have to negotiate speeds, protocols, check for conflicting MACs and several other things (depending on the switch/router). It is possible for that to take a couple of seconds. I'll bet that if you unplug the cable from the switch, then plug it back in, you'll probably notice a slight delay in the port's link LED lighting up as well. Pretty common and not necessarily indicative of a problem. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks at alldigital.com - - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - - - - Never put off 'til tommorrow what you can forget altogether! - ---------------------------------------------------------------------- From fmdlc.unix at gmail.com Wed Sep 17 19:03:51 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Wed, 17 Sep 2014 16:03:51 -0300 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <5419D83C.9020200@alldigital.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> <5419D83C.9020200@alldigital.com> Message-ID: On Sep 17, 2014, at 15:51, Rick Stevens wrote: > On 09/17/2014 08:20 AM, Vallevand, Mark K issued this missive: >> Tried replacing the switch with a crossover cable. The problem goes >> away. It looks like there is some odd delay in the switch. The NIC is >> configured, but it takes 4 seconds for the link to go up. Huh. >> >> We have a dedicated network for all the cluster traffic. Nothing else >> uses it. In the two-node case, we use a cable. In larger clusters we >> will use a switch. First delivery is for two-node clusters. But, I >> worry about that slow switch. > > Switches have to negotiate speeds, protocols, check for conflicting MACs and several other things (depending on the switch/router). It is > possible for that to take a couple of seconds. > > I'll bet that if you unplug the cable from the switch, then plug it > back in, you'll probably notice a slight delay in the port's link LED > lighting up as well. Pretty common and not necessarily indicative of a > problem. > ---------------------------------------------------------------------- > - Rick Stevens, Systems Engineer, AllDigital ricks at alldigital.com - > - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - > - - > - Never put off 'til tommorrow what you can forget altogether! - > ---------------------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi everyone, Just let me ask one small thing. Did you enable Spanning Tree Protocol on the interconnect switch? STP is not compatible with TOTEM RRP, it?s because STP is flapping all the time between BLOCKED / FORWARDING state on the port, then TOTEM will be not able to transmit heartbeat packages and when you get a number of four TOTEM error (an error is a time ~238 ms + overhead) the node can be fenced or can raise issue like this. Remember configure all the interconnect ports in the same multicast group too. Bests regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.? - Rich Cook -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From Mark.Vallevand at UNISYS.com Wed Sep 17 19:07:13 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 14:07:13 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D90DF8@USEA-EXCH8.na.uis.unisys.com> Oops. In number 2, I read fencing as STONITH. My bad. I think some form of fencing is configured. My cluster.conf file has this in it: Does that configure fencing? I'm considering adding this to the cluster.conf: This raises the initial join delay when clustering starts. Default is 6 seconds. 6 seconds kind of matches what I am seeing when clustering starts and the NIC link is slow to go up. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K Sent: Wednesday, September 17, 2014 09:35 AM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready Thanks. 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. Andrew: Thanks for the prompt response. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Tuesday, September 16, 2014 08:51 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore 2. configure fencing 3. find a newer version of pacemaker, we're up to .12 now > A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. > > Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. > > Is it possible to introduce a delay into cman or corosync startup? Is that even wise? > Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? > > Any suggestions would be welcome. > > Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. > > Regards. > Mark K Vallevand > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Mark.Vallevand at UNISYS.com Wed Sep 17 19:19:11 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 14:19:11 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> <5419D83C.9020200@alldigital.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D90E2C@USEA-EXCH8.na.uis.unisys.com> We will look at the STP settings on the switch. However, the switch works fine after link comes up. We suspected STP, too. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Facundo M. de la Cruz Sent: Wednesday, September 17, 2014 02:04 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On Sep 17, 2014, at 15:51, Rick Stevens wrote: > On 09/17/2014 08:20 AM, Vallevand, Mark K issued this missive: >> Tried replacing the switch with a crossover cable. The problem goes >> away. It looks like there is some odd delay in the switch. The NIC is >> configured, but it takes 4 seconds for the link to go up. Huh. >> >> We have a dedicated network for all the cluster traffic. Nothing else >> uses it. In the two-node case, we use a cable. In larger clusters we >> will use a switch. First delivery is for two-node clusters. But, I >> worry about that slow switch. > > Switches have to negotiate speeds, protocols, check for conflicting MACs and several other things (depending on the switch/router). It is > possible for that to take a couple of seconds. > > I'll bet that if you unplug the cable from the switch, then plug it > back in, you'll probably notice a slight delay in the port's link LED > lighting up as well. Pretty common and not necessarily indicative of a > problem. > ---------------------------------------------------------------------- > - Rick Stevens, Systems Engineer, AllDigital ricks at alldigital.com - > - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - > - - > - Never put off 'til tommorrow what you can forget altogether! - > ---------------------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi everyone, Just let me ask one small thing. Did you enable Spanning Tree Protocol on the interconnect switch? STP is not compatible with TOTEM RRP, it's because STP is flapping all the time between BLOCKED / FORWARDING state on the port, then TOTEM will be not able to transmit heartbeat packages and when you get a number of four TOTEM error (an error is a time ~238 ms + overhead) the node can be fenced or can raise issue like this. Remember configure all the interconnect ports in the same multicast group too. Bests regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook From Mark.Vallevand at UNISYS.com Wed Sep 17 19:57:28 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 14:57:28 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> <5419D83C.9020200@alldigital.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D90F04@USEA-EXCH8.na.uis.unisys.com> Can I disable Totem RRP and use some thing else? Is there something to make things compatible with STP? Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Facundo M. de la Cruz Sent: Wednesday, September 17, 2014 02:04 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On Sep 17, 2014, at 15:51, Rick Stevens wrote: > On 09/17/2014 08:20 AM, Vallevand, Mark K issued this missive: >> Tried replacing the switch with a crossover cable. The problem goes >> away. It looks like there is some odd delay in the switch. The NIC is >> configured, but it takes 4 seconds for the link to go up. Huh. >> >> We have a dedicated network for all the cluster traffic. Nothing else >> uses it. In the two-node case, we use a cable. In larger clusters we >> will use a switch. First delivery is for two-node clusters. But, I >> worry about that slow switch. > > Switches have to negotiate speeds, protocols, check for conflicting MACs and several other things (depending on the switch/router). It is > possible for that to take a couple of seconds. > > I'll bet that if you unplug the cable from the switch, then plug it > back in, you'll probably notice a slight delay in the port's link LED > lighting up as well. Pretty common and not necessarily indicative of a > problem. > ---------------------------------------------------------------------- > - Rick Stevens, Systems Engineer, AllDigital ricks at alldigital.com - > - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - > - - > - Never put off 'til tommorrow what you can forget altogether! - > ---------------------------------------------------------------------- > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi everyone, Just let me ask one small thing. Did you enable Spanning Tree Protocol on the interconnect switch? STP is not compatible with TOTEM RRP, it's because STP is flapping all the time between BLOCKED / FORWARDING state on the port, then TOTEM will be not able to transmit heartbeat packages and when you get a number of four TOTEM error (an error is a time ~238 ms + overhead) the node can be fenced or can raise issue like this. Remember configure all the interconnect ports in the same multicast group too. Bests regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook From Mark.Vallevand at UNISYS.com Wed Sep 17 20:35:18 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Wed, 17 Sep 2014 15:35:18 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7D90DF8@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7D90DF8@USEA-EXCH8.na.uis.unisys.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D90FD9@USEA-EXCH8.na.uis.unisys.com> WooHoo. I added: in cluster.conf and I think it's working. So, what does the two_node do? And, a follow up question: What will happen if crm configure property no-quorum-policy=ignore" is set on clusters with more than 2 nodes? Should I skip that on clusters with more than two nodes? Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K Sent: Wednesday, September 17, 2014 02:07 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready Oops. In number 2, I read fencing as STONITH. My bad. I think some form of fencing is configured. My cluster.conf file has this in it: Does that configure fencing? I'm considering adding this to the cluster.conf: This raises the initial join delay when clustering starts. Default is 6 seconds. 6 seconds kind of matches what I am seeing when clustering starts and the NIC link is slow to go up. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K Sent: Wednesday, September 17, 2014 09:35 AM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready Thanks. 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. Andrew: Thanks for the prompt response. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Tuesday, September 16, 2014 08:51 PM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore 2. configure fencing 3. find a newer version of pacemaker, we're up to .12 now > A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. > > Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. > > Is it possible to introduce a delay into cman or corosync startup? Is that even wise? > Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? > > Any suggestions would be welcome. > > Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. > > Regards. > Mark K Vallevand > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From fmdlc.unix at gmail.com Wed Sep 17 20:41:27 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Wed, 17 Sep 2014 17:41:27 -0300 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7D90F04@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7CE9E32@USEA-EXCH8.na.uis.unisys.com> <5419D83C.9020200@alldigital.com> <99C8B2929B39C24493377AC7A121E21FFEA7D90F04@USEA-EXCH8.na.uis.unisys.com> Message-ID: <53E9B255-726F-4C6D-AA57-2D38053F9D7C@gmail.com> On Sep 17, 2014, at 16:57, Vallevand, Mark K wrote: > Can I disable Totem RRP and use some thing else? > Is there something to make things compatible with STP? > > > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Facundo M. de la Cruz > Sent: Wednesday, September 17, 2014 02:04 PM > To: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > > On Sep 17, 2014, at 15:51, Rick Stevens wrote: > >> On 09/17/2014 08:20 AM, Vallevand, Mark K issued this missive: >>> Tried replacing the switch with a crossover cable. The problem goes >>> away. It looks like there is some odd delay in the switch. The NIC is >>> configured, but it takes 4 seconds for the link to go up. Huh. >>> >>> We have a dedicated network for all the cluster traffic. Nothing else >>> uses it. In the two-node case, we use a cable. In larger clusters we >>> will use a switch. First delivery is for two-node clusters. But, I >>> worry about that slow switch. >> >> Switches have to negotiate speeds, protocols, check for conflicting MACs and several other things (depending on the switch/router). It is >> possible for that to take a couple of seconds. >> >> I'll bet that if you unplug the cable from the switch, then plug it >> back in, you'll probably notice a slight delay in the port's link LED >> lighting up as well. Pretty common and not necessarily indicative of a >> problem. >> ---------------------------------------------------------------------- >> - Rick Stevens, Systems Engineer, AllDigital ricks at alldigital.com - >> - AIM/Skype: therps2 ICQ: 22643734 Yahoo: origrps2 - >> - - >> - Never put off 'til tommorrow what you can forget altogether! - >> ---------------------------------------------------------------------- >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > Hi everyone, > > Just let me ask one small thing. > Did you enable Spanning Tree Protocol on the interconnect switch? > STP is not compatible with TOTEM RRP, it's because STP is flapping all the time between BLOCKED / FORWARDING state on the port, then TOTEM will be not able to transmit heartbeat packages and when you get a number of four TOTEM error (an error is a time ~238 ms + overhead) the node can be fenced or can raise issue like this. > > Remember configure all the interconnect ports in the same multicast group too. > > Bests regards. > > -- > Facundo M. de la Cruz (tty0) > Information Technology Specialist > Movil: +54 911 56528301 > > http://codigounix.blogspot.com/ > http://twitter.com/_tty0 > > GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 > > "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster TOTEM is the protocol used to negotiate the clustering membershit. You are not able to disable this protocol, because you will lost your cluster infraestructure. STP is not compatible with TOTEM, so my best advisor for you is just disable it. Bests. F-. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.? - Rich Cook -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From fmdlc.unix at gmail.com Wed Sep 17 20:44:06 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Wed, 17 Sep 2014 17:44:06 -0300 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7D90FD9@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7D90DF8@USEA-EXCH8.na.uis.unisys.com> <99C8B2929B39C24493377AC7A121E21FFEA7D90FD9@USEA-EXCH8.na.uis.unisys.com> Message-ID: On Sep 17, 2014, at 17:35, Vallevand, Mark K wrote: > WooHoo. > > I added: > > > in cluster.conf and I think it's working. > > So, what does the two_node do? > > And, a follow up question: > What will happen if crm configure property no-quorum-policy=ignore" is set on clusters with more than 2 nodes? > Should I skip that on clusters with more than two nodes? > > > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K > Sent: Wednesday, September 17, 2014 02:07 PM > To: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > Oops. In number 2, I read fencing as STONITH. My bad. > I think some form of fencing is configured. > My cluster.conf file has this in it: > > > > Does that configure fencing? > > I'm considering adding this to the cluster.conf: > > > This raises the initial join delay when clustering starts. Default is 6 > seconds. 6 seconds kind of matches what I am seeing when clustering starts > and the NIC link is slow to go up. > > > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Vallevand, Mark K > Sent: Wednesday, September 17, 2014 09:35 AM > To: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > Thanks. > > 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? > 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. > 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? > > Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. > > Andrew: Thanks for the prompt response. > > > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Tuesday, September 16, 2014 08:51 PM > To: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > > On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > >> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. > > 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore > 2. configure fencing > 3. find a newer version of pacemaker, we're up to .12 now > >> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >> >> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >> >> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >> >> Any suggestions would be welcome. >> >> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >> >> Regards. >> Mark K Vallevand >> "If there are no dogs in Heaven, then when I die I want to go where they went." >> -Will Rogers >> >> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster The option two_nodes="1" tells the cluster manager to continue operating with only one vote. This option requires that the expected_votes="" attribute be set to 1, because is you lost one cluster node, you have the another node running yet. Normally, expected_votes is set automatically to the total sum of the defined cluster nodes' votes (which itself is a default of 1). Regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.? - Rich Cook -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From andrew at beekhof.net Thu Sep 18 01:35:25 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 18 Sep 2014 11:35:25 +1000 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> Message-ID: On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: > Thanks. > > 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. Chrissie: Can you elaborate on the details here please? (Short version, it should do what you want) > 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. > 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? Well you might get 3+ years of bug fixes and performance improvements :-) > > Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. Is there not a way to tell upstart not to start the cluster until the network is up? > > Andrew: Thanks for the prompt response. > > > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof > Sent: Tuesday, September 16, 2014 08:51 PM > To: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > > On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: > >> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. > > 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore > 2. configure fencing > 3. find a newer version of pacemaker, we're up to .12 now > >> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >> >> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >> >> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >> >> Any suggestions would be welcome. >> >> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >> >> Regards. >> Mark K Vallevand >> "If there are no dogs in Heaven, then when I die I want to go where they went." >> -Will Rogers >> >> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From ccaulfie at redhat.com Thu Sep 18 08:18:46 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Thu, 18 Sep 2014 09:18:46 +0100 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> Message-ID: <541A9566.8060004@redhat.com> On 18/09/14 02:35, Andrew Beekhof wrote: > > On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: > >> Thanks. >> >> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? > > I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. > Chrissie: Can you elaborate on the details here please? > it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. Chrissie > (Short version, it should do what you want) > >> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? > > Well you might get 3+ years of bug fixes and performance improvements :-) > >> >> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. > > Is there not a way to tell upstart not to start the cluster until the network is up? > >> >> Andrew: Thanks for the prompt response. >> >> >> Regards. >> Mark K Vallevand >> >> "If there are no dogs in Heaven, then when I die I want to go where they went." >> -Will Rogers >> >> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >> Sent: Tuesday, September 16, 2014 08:51 PM >> To: linux clustering >> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >> >> >> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >> >>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >> >> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >> 2. configure fencing >> 3. find a newer version of pacemaker, we're up to .12 now >> >>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>> >>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>> >>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>> >>> Any suggestions would be welcome. >>> >>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>> >>> Regards. >>> Mark K Vallevand >>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>> -Will Rogers >>> >>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > From andrew at beekhof.net Thu Sep 18 08:29:13 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 18 Sep 2014 18:29:13 +1000 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <541A9566.8060004@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> Message-ID: <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: > On 18/09/14 02:35, Andrew Beekhof wrote: >> >> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >> >>> Thanks. >>> >>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >> >> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >> Chrissie: Can you elaborate on the details here please? >> > > it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. Ah! Good to know. Two node clusters Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. One thing thats not clear to me is what happens when a single node comes up and can only see itself. Does it get quorum or is it like wait-for-all in corosync2? > > Chrissie > > >> (Short version, it should do what you want) >> >>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >> >> Well you might get 3+ years of bug fixes and performance improvements :-) >> >>> >>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >> >> Is there not a way to tell upstart not to start the cluster until the network is up? >> >>> >>> Andrew: Thanks for the prompt response. >>> >>> >>> Regards. >>> Mark K Vallevand >>> >>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>> -Will Rogers >>> >>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>> Sent: Tuesday, September 16, 2014 08:51 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>> >>> >>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>> >>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>> >>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>> 2. configure fencing >>> 3. find a newer version of pacemaker, we're up to .12 now >>> >>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>> >>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>> >>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>> >>>> Any suggestions would be welcome. >>>> >>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>> >>>> Regards. >>>> Mark K Vallevand >>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>> -Will Rogers >>>> >>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From ccaulfie at redhat.com Thu Sep 18 08:33:27 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Thu, 18 Sep 2014 09:33:27 +0100 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> Message-ID: <541A98D7.8090306@redhat.com> On 18/09/14 09:29, Andrew Beekhof wrote: > > On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: > >> On 18/09/14 02:35, Andrew Beekhof wrote: >>> >>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>> >>>> Thanks. >>>> >>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>> >>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>> Chrissie: Can you elaborate on the details here please? >>> >> >> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. > > Ah! Good to know. > > Two node clusters > Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other > fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. > > > > > One thing thats not clear to me is what happens when a single node comes up and can only see itself. > Does it get quorum or is it like wait-for-all in corosync2? > There's no wait_for_all in cman. The first node up will attempt (after fence_join_delay) the other node in an attempt to stop a split brain. This is one of several reasons why we insist that the fencing is on a separate network to heartbeat on a two_node cluster. Chrissie >> >> Chrissie >> >> >>> (Short version, it should do what you want) >>> >>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>> >>> Well you might get 3+ years of bug fixes and performance improvements :-) >>> >>>> >>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>> >>> Is there not a way to tell upstart not to start the cluster until the network is up? >>> >>>> >>>> Andrew: Thanks for the prompt response. >>>> >>>> >>>> Regards. >>>> Mark K Vallevand >>>> >>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>> -Will Rogers >>>> >>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>> >>>> >>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>> >>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>> >>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>> 2. configure fencing >>>> 3. find a newer version of pacemaker, we're up to .12 now >>>> >>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>> >>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>> >>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>> >>>>> Any suggestions would be welcome. >>>>> >>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>> >>>>> Regards. >>>>> Mark K Vallevand >>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>> -Will Rogers >>>>> >>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> > From andrew at beekhof.net Thu Sep 18 10:21:53 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Thu, 18 Sep 2014 20:21:53 +1000 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <541A98D7.8090306@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> Message-ID: That doesn't sound much different to no-quorum-policy=ignore So I guess it won't help here Sent from my iPad > On 18 Sep 2014, at 6:33 pm, Christine Caulfield wrote: > >> On 18/09/14 09:29, Andrew Beekhof wrote: >> >>> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: >>> >>>> On 18/09/14 02:35, Andrew Beekhof wrote: >>>> >>>>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>>>> >>>>> Thanks. >>>>> >>>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>>> >>>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>>> Chrissie: Can you elaborate on the details here please? >>>> >>> >>> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. >> >> Ah! Good to know. >> >> Two node clusters >> Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other >> fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. >> >> >> >> >> One thing thats not clear to me is what happens when a single node comes up and can only see itself. >> Does it get quorum or is it like wait-for-all in corosync2? >> > > > There's no wait_for_all in cman. The first node up will attempt (after fence_join_delay) the other node in an attempt to stop a split brain. > > This is one of several reasons why we insist that the fencing is on a separate network to heartbeat on a two_node cluster. > > > Chrissie > >>> >>> Chrissie >>> >>> >>>> (Short version, it should do what you want) >>>> >>>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>>> >>>> Well you might get 3+ years of bug fixes and performance improvements :-) >>>> >>>>> >>>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>>> >>>> Is there not a way to tell upstart not to start the cluster until the network is up? >>>> >>>>> >>>>> Andrew: Thanks for the prompt response. >>>>> >>>>> >>>>> Regards. >>>>> Mark K Vallevand >>>>> >>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>> -Will Rogers >>>>> >>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>>> To: linux clustering >>>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>>> >>>>> >>>>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>>>> >>>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>>> >>>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>>> 2. configure fencing >>>>> 3. find a newer version of pacemaker, we're up to .12 now >>>>> >>>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>>> >>>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>>> >>>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>>> >>>>>> Any suggestions would be welcome. >>>>>> >>>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>>> >>>>>> Regards. >>>>>> Mark K Vallevand >>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>> -Will Rogers >>>>>> >>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >> > From Mark.Vallevand at UNISYS.com Thu Sep 18 13:09:41 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Thu, 18 Sep 2014 08:09:41 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <541A98D7.8090306@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> Hmmm. I'm still curious what two_node exactly does. In my testing, the clustering software comes up before the network is completely ready. (Why? That's another day.) With just no-quorum-policy=ignore, regardless of the fence_join_delay value, the rebooted node fences the other node and starts up all split-brain. It takes about 30 seconds or so after the network is ready for the split brain to be detected. With no-quorum-policy=ignore and two_node="1" expected_votes="1", regardless of the fence_join_delay value, the rebooted node fences the other node, but as soon as the network is ready the other node joins the network and there is no split-brain. I'm happy that things are working, but I'm still curious for some idea about what two_node does. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield Sent: Thursday, September 18, 2014 03:33 AM To: Andrew Beekhof Cc: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 18/09/14 09:29, Andrew Beekhof wrote: > > On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: > >> On 18/09/14 02:35, Andrew Beekhof wrote: >>> >>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>> >>>> Thanks. >>>> >>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>> >>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>> Chrissie: Can you elaborate on the details here please? >>> >> >> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. > > Ah! Good to know. > > Two node clusters > Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other > fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. > > > > > One thing thats not clear to me is what happens when a single node comes up and can only see itself. > Does it get quorum or is it like wait-for-all in corosync2? > There's no wait_for_all in cman. The first node up will attempt (after fence_join_delay) the other node in an attempt to stop a split brain. This is one of several reasons why we insist that the fencing is on a separate network to heartbeat on a two_node cluster. Chrissie >> >> Chrissie >> >> >>> (Short version, it should do what you want) >>> >>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>> >>> Well you might get 3+ years of bug fixes and performance improvements :-) >>> >>>> >>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>> >>> Is there not a way to tell upstart not to start the cluster until the network is up? >>> >>>> >>>> Andrew: Thanks for the prompt response. >>>> >>>> >>>> Regards. >>>> Mark K Vallevand >>>> >>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>> -Will Rogers >>>> >>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>> To: linux clustering >>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>> >>>> >>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>> >>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>> >>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>> 2. configure fencing >>>> 3. find a newer version of pacemaker, we're up to .12 now >>>> >>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>> >>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>> >>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>> >>>>> Any suggestions would be welcome. >>>>> >>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>> >>>>> Regards. >>>>> Mark K Vallevand >>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>> -Will Rogers >>>>> >>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >> > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From Mark.Vallevand at UNISYS.com Thu Sep 18 13:09:34 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Thu, 18 Sep 2014 08:09:34 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D915A5@USEA-EXCH8.na.uis.unisys.com> I wish I had seen this in the Pacemaker Explained or Clusters From Scratch tutorials. Still, I'm used to reading man pages to ferret out all kinds of missing details from tutorials. I missed . Yes, we were investigating a change to delay clustering startup Again, many thanks. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof Sent: Thursday, September 18, 2014 03:29 AM To: Christine Caulfield Cc: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: > On 18/09/14 02:35, Andrew Beekhof wrote: >> >> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >> >>> Thanks. >>> >>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >> >> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >> Chrissie: Can you elaborate on the details here please? >> > > it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. Ah! Good to know. Two node clusters Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. One thing thats not clear to me is what happens when a single node comes up and can only see itself. Does it get quorum or is it like wait-for-all in corosync2? > > Chrissie > > >> (Short version, it should do what you want) >> >>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >> >> Well you might get 3+ years of bug fixes and performance improvements :-) >> >>> >>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >> >> Is there not a way to tell upstart not to start the cluster until the network is up? >> >>> >>> Andrew: Thanks for the prompt response. >>> >>> >>> Regards. >>> Mark K Vallevand >>> >>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>> -Will Rogers >>> >>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>> Sent: Tuesday, September 16, 2014 08:51 PM >>> To: linux clustering >>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>> >>> >>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>> >>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>> >>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>> 2. configure fencing >>> 3. find a newer version of pacemaker, we're up to .12 now >>> >>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>> >>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>> >>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>> >>>> Any suggestions would be welcome. >>>> >>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>> >>>> Regards. >>>> Mark K Vallevand >>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>> -Will Rogers >>>> >>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster at redhat.com >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster at redhat.com >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> > From ccaulfie at redhat.com Thu Sep 18 13:25:37 2014 From: ccaulfie at redhat.com (Christine Caulfield) Date: Thu, 18 Sep 2014 14:25:37 +0100 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> Message-ID: <541ADD51.9020201@redhat.com> On 18/09/14 14:09, Vallevand, Mark K wrote: > Hmmm. I'm still curious what two_node exactly does. > > In my testing, the clustering software comes up before the network is completely ready. (Why? That's another day.) > > With just no-quorum-policy=ignore, regardless of the fence_join_delay value, the rebooted node fences the other node and starts up all split-brain. It takes about 30 seconds or so after the network is ready for the split brain to be detected. > > With no-quorum-policy=ignore and two_node="1" expected_votes="1", regardless of the fence_join_delay value, the rebooted node fences the other node, but as soon as the network is ready the other node joins the network and there is no split-brain. > > I'm happy that things are working, but I'm still curious for some idea about what two_node does. > > two_node is simply to allow a 2 node cluster to remain quorate when one node is unavailable - it's a special case that allows the cluster to remain running when quorum is 1. It requires hardware fencing to make sure that one node is fenced and can't do any harm to the remaining node. It's nothing more complicated than that. If is set (and this is not directly part of two_node, but useful to know for all clusters) then the other node will not be fenced for x seconds after the first node starts up, which should take care of your fence trouble. Sorry, I misnamed the parameter in my first email. Chrissie > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield > Sent: Thursday, September 18, 2014 03:33 AM > To: Andrew Beekhof > Cc: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > On 18/09/14 09:29, Andrew Beekhof wrote: >> >> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: >> >>> On 18/09/14 02:35, Andrew Beekhof wrote: >>>> >>>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>>> >>>>> Thanks. >>>>> >>>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>>> >>>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>>> Chrissie: Can you elaborate on the details here please? >>>> >>> >>> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. >> >> Ah! Good to know. >> >> Two node clusters >> Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other >> fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. >> >> >> >> >> One thing thats not clear to me is what happens when a single node comes up and can only see itself. >> Does it get quorum or is it like wait-for-all in corosync2? >> > > > There's no wait_for_all in cman. The first node up will attempt (after > fence_join_delay) the other node in an attempt to stop a split brain. > > This is one of several reasons why we insist that the fencing is on a > separate network to heartbeat on a two_node cluster. > > > Chrissie > >>> >>> Chrissie >>> >>> >>>> (Short version, it should do what you want) >>>> >>>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>>> >>>> Well you might get 3+ years of bug fixes and performance improvements :-) >>>> >>>>> >>>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>>> >>>> Is there not a way to tell upstart not to start the cluster until the network is up? >>>> >>>>> >>>>> Andrew: Thanks for the prompt response. >>>>> >>>>> >>>>> Regards. >>>>> Mark K Vallevand >>>>> >>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>> -Will Rogers >>>>> >>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>>> To: linux clustering >>>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>>> >>>>> >>>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>>> >>>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>>> >>>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>>> 2. configure fencing >>>>> 3. find a newer version of pacemaker, we're up to .12 now >>>>> >>>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>>> >>>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>>> >>>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>>> >>>>>> Any suggestions would be welcome. >>>>>> >>>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>>> >>>>>> Regards. >>>>>> Mark K Vallevand >>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>> -Will Rogers >>>>>> >>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >> > From Mark.Vallevand at UNISYS.com Thu Sep 18 13:39:47 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Thu, 18 Sep 2014 08:39:47 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <541ADD51.9020201@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> <541ADD51.9020201@redhat.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D91662@USEA-EXCH8.na.uis.unisys.com> Thanks! Looking back at my changes, it seems like I needed both two_node set and post_join_delay set to a larger value. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield Sent: Thursday, September 18, 2014 08:26 AM To: linux-cluster at redhat.com Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On 18/09/14 14:09, Vallevand, Mark K wrote: > Hmmm. I'm still curious what two_node exactly does. > > In my testing, the clustering software comes up before the network is completely ready. (Why? That's another day.) > > With just no-quorum-policy=ignore, regardless of the fence_join_delay value, the rebooted node fences the other node and starts up all split-brain. It takes about 30 seconds or so after the network is ready for the split brain to be detected. > > With no-quorum-policy=ignore and two_node="1" expected_votes="1", regardless of the fence_join_delay value, the rebooted node fences the other node, but as soon as the network is ready the other node joins the network and there is no split-brain. > > I'm happy that things are working, but I'm still curious for some idea about what two_node does. > > two_node is simply to allow a 2 node cluster to remain quorate when one node is unavailable - it's a special case that allows the cluster to remain running when quorum is 1. It requires hardware fencing to make sure that one node is fenced and can't do any harm to the remaining node. It's nothing more complicated than that. If is set (and this is not directly part of two_node, but useful to know for all clusters) then the other node will not be fenced for x seconds after the first node starts up, which should take care of your fence trouble. Sorry, I misnamed the parameter in my first email. Chrissie > Regards. > Mark K Vallevand > > "If there are no dogs in Heaven, then when I die I want to go where they went." > -Will Rogers > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. > > > -----Original Message----- > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield > Sent: Thursday, September 18, 2014 03:33 AM > To: Andrew Beekhof > Cc: linux clustering > Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready > > On 18/09/14 09:29, Andrew Beekhof wrote: >> >> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: >> >>> On 18/09/14 02:35, Andrew Beekhof wrote: >>>> >>>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>>> >>>>> Thanks. >>>>> >>>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>>> >>>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>>> Chrissie: Can you elaborate on the details here please? >>>> >>> >>> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. >> >> Ah! Good to know. >> >> Two node clusters >> Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other >> fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. >> >> >> >> >> One thing thats not clear to me is what happens when a single node comes up and can only see itself. >> Does it get quorum or is it like wait-for-all in corosync2? >> > > > There's no wait_for_all in cman. The first node up will attempt (after > fence_join_delay) the other node in an attempt to stop a split brain. > > This is one of several reasons why we insist that the fencing is on a > separate network to heartbeat on a two_node cluster. > > > Chrissie > >>> >>> Chrissie >>> >>> >>>> (Short version, it should do what you want) >>>> >>>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>>> >>>> Well you might get 3+ years of bug fixes and performance improvements :-) >>>> >>>>> >>>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>>> >>>> Is there not a way to tell upstart not to start the cluster until the network is up? >>>> >>>>> >>>>> Andrew: Thanks for the prompt response. >>>>> >>>>> >>>>> Regards. >>>>> Mark K Vallevand >>>>> >>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>> -Will Rogers >>>>> >>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>>> To: linux clustering >>>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>>> >>>>> >>>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>>> >>>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>>> >>>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>>> 2. configure fencing >>>>> 3. find a newer version of pacemaker, we're up to .12 now >>>>> >>>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>>> >>>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>>> >>>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>>> >>>>>> Any suggestions would be welcome. >>>>>> >>>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>>> >>>>>> Regards. >>>>>> Mark K Vallevand >>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>> -Will Rogers >>>>>> >>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster at redhat.com >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >> > -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster From fmdlc.unix at gmail.com Thu Sep 18 13:47:28 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Thu, 18 Sep 2014 10:47:28 -0300 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <541ADD51.9020201@redhat.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> <541ADD51.9020201@redhat.com> Message-ID: <1EB11906-6312-4D7D-B8B3-FE6F18562469@gmail.com> On Sep 18, 2014, at 10:25, Christine Caulfield wrote: > On 18/09/14 14:09, Vallevand, Mark K wrote: >> Hmmm. I'm still curious what two_node exactly does. >> >> In my testing, the clustering software comes up before the network is completely ready. (Why? That's another day.) >> >> With just no-quorum-policy=ignore, regardless of the fence_join_delay value, the rebooted node fences the other node and starts up all split-brain. It takes about 30 seconds or so after the network is ready for the split brain to be detected. >> >> With no-quorum-policy=ignore and two_node="1" expected_votes="1", regardless of the fence_join_delay value, the rebooted node fences the other node, but as soon as the network is ready the other node joins the network and there is no split-brain. >> >> I'm happy that things are working, but I'm still curious for some idea about what two_node does. >> >> > > two_node is simply to allow a 2 node cluster to remain quorate when one node is unavailable - it's a special case that allows the cluster to remain running when quorum is 1. It requires hardware fencing to make sure that one node is fenced and can't do any harm to the remaining node. It's nothing more complicated than that. > > If is set (and this is not directly part of two_node, but useful to know for all clusters) then the other node will not be fenced for x seconds after the first node starts up, which should take care of your fence trouble. > > Sorry, I misnamed the parameter in my first email. > > Chrissie > >> Regards. >> Mark K Vallevand >> >> "If there are no dogs in Heaven, then when I die I want to go where they went." >> -Will Rogers >> >> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield >> Sent: Thursday, September 18, 2014 03:33 AM >> To: Andrew Beekhof >> Cc: linux clustering >> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >> >> On 18/09/14 09:29, Andrew Beekhof wrote: >>> >>> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: >>> >>>> On 18/09/14 02:35, Andrew Beekhof wrote: >>>>> >>>>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>>>> >>>>>> Thanks. >>>>>> >>>>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>>>> >>>>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>>>> Chrissie: Can you elaborate on the details here please? >>>>> >>>> >>>> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. >>> >>> Ah! Good to know. >>> >>> Two node clusters >>> Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other >>> fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. >>> >>> >>> >>> >>> One thing thats not clear to me is what happens when a single node comes up and can only see itself. >>> Does it get quorum or is it like wait-for-all in corosync2? >>> >> >> >> There's no wait_for_all in cman. The first node up will attempt (after >> fence_join_delay) the other node in an attempt to stop a split brain. >> >> This is one of several reasons why we insist that the fencing is on a >> separate network to heartbeat on a two_node cluster. >> >> >> Chrissie >> >>>> >>>> Chrissie >>>> >>>> >>>>> (Short version, it should do what you want) >>>>> >>>>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>>>> >>>>> Well you might get 3+ years of bug fixes and performance improvements :-) >>>>> >>>>>> >>>>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>>>> >>>>> Is there not a way to tell upstart not to start the cluster until the network is up? >>>>> >>>>>> >>>>>> Andrew: Thanks for the prompt response. >>>>>> >>>>>> >>>>>> Regards. >>>>>> Mark K Vallevand >>>>>> >>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>> -Will Rogers >>>>>> >>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>>>> To: linux clustering >>>>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>>>> >>>>>> >>>>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>>>> >>>>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>>>> >>>>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>>>> 2. configure fencing >>>>>> 3. find a newer version of pacemaker, we're up to .12 now >>>>>> >>>>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>>>> >>>>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>>>> >>>>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>>>> >>>>>>> Any suggestions would be welcome. >>>>>>> >>>>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>>>> >>>>>>> Regards. >>>>>>> Mark K Vallevand >>>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>>> -Will Rogers >>>>>>> >>>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi, let me add something more: The ?fence_join_delay" option is used to avoid a condition called "dual fencing? which can leaves your cluster entirely powered down. "Dual fencing" can suscced if you have a two nodes cluster with no quorum and you are using fencing through IPMI agents. So both nodes would try to execute fencing actions against each other using the IPMI interface. ?fence_join_delay? works adding a countdown to avoid to the remain node execute fencing actions for X time, just the first node can send IPMI command to the other one. Best regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.? - Rich Cook -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From Mark.Vallevand at UNISYS.com Thu Sep 18 13:59:44 2014 From: Mark.Vallevand at UNISYS.com (Vallevand, Mark K) Date: Thu, 18 Sep 2014 08:59:44 -0500 Subject: [Linux-cluster] Cman (and corosync) starting before network interface is ready In-Reply-To: <1EB11906-6312-4D7D-B8B3-FE6F18562469@gmail.com> References: <99C8B2929B39C24493377AC7A121E21FFEA7CE969D@USEA-EXCH8.na.uis.unisys.com> <52D657CE-40C3-45C6-A50D-73D83CA5930F@beekhof.net> <99C8B2929B39C24493377AC7A121E21FFEA7CE9CF9@USEA-EXCH8.na.uis.unisys.com> <541A9566.8060004@redhat.com> <13486205-A1B5-4135-B61B-98326500BECC@beekhof.net> <541A98D7.8090306@redhat.com> <99C8B2929B39C24493377AC7A121E21FFEA7D915A6@USEA-EXCH8.na.uis.unisys.com> <541ADD51.9020201@redhat.com> <1EB11906-6312-4D7D-B8B3-FE6F18562469@gmail.com> Message-ID: <99C8B2929B39C24493377AC7A121E21FFEA7D916D5@USEA-EXCH8.na.uis.unisys.com> Thanks! This is good to know. We are looking at fencing via IPMI in a future release. Regards. Mark K Vallevand "If there are no dogs in Heaven, then when I die I want to go where they went." -Will Rogers THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. -----Original Message----- From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Facundo M. de la Cruz Sent: Thursday, September 18, 2014 08:47 AM To: linux clustering Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready On Sep 18, 2014, at 10:25, Christine Caulfield wrote: > On 18/09/14 14:09, Vallevand, Mark K wrote: >> Hmmm. I'm still curious what two_node exactly does. >> >> In my testing, the clustering software comes up before the network is completely ready. (Why? That's another day.) >> >> With just no-quorum-policy=ignore, regardless of the fence_join_delay value, the rebooted node fences the other node and starts up all split-brain. It takes about 30 seconds or so after the network is ready for the split brain to be detected. >> >> With no-quorum-policy=ignore and two_node="1" expected_votes="1", regardless of the fence_join_delay value, the rebooted node fences the other node, but as soon as the network is ready the other node joins the network and there is no split-brain. >> >> I'm happy that things are working, but I'm still curious for some idea about what two_node does. >> >> > > two_node is simply to allow a 2 node cluster to remain quorate when one node is unavailable - it's a special case that allows the cluster to remain running when quorum is 1. It requires hardware fencing to make sure that one node is fenced and can't do any harm to the remaining node. It's nothing more complicated than that. > > If is set (and this is not directly part of two_node, but useful to know for all clusters) then the other node will not be fenced for x seconds after the first node starts up, which should take care of your fence trouble. > > Sorry, I misnamed the parameter in my first email. > > Chrissie > >> Regards. >> Mark K Vallevand >> >> "If there are no dogs in Heaven, then when I die I want to go where they went." >> -Will Rogers >> >> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >> >> >> -----Original Message----- >> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield >> Sent: Thursday, September 18, 2014 03:33 AM >> To: Andrew Beekhof >> Cc: linux clustering >> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >> >> On 18/09/14 09:29, Andrew Beekhof wrote: >>> >>> On 18 Sep 2014, at 6:18 pm, Christine Caulfield wrote: >>> >>>> On 18/09/14 02:35, Andrew Beekhof wrote: >>>>> >>>>> On 18 Sep 2014, at 12:34 am, Vallevand, Mark K wrote: >>>>> >>>>>> Thanks. >>>>>> >>>>>> 1. I didn't know about two-node mode. Thanks. We are testing with two nodes and "crm configure property no-quorum-policy=ignore". When one node goes down, the other node continues clustering. This is the desired behavior. What will in cluster.conf do? >>>>> >>>>> I was all set to be a smart-ass and say 'man cluster.conf', but the joke is on me as my colleagues do not appear to have documented it anywhere. >>>>> Chrissie: Can you elaborate on the details here please? >>>>> >>>> >>>> it's documented in the cman(5) man page. The entries in cluster.conf only cover the general parts that are not specific to any subsystem. So corosync items are documented in the corosync man page and cman ones in the cman man page etc. >>> >>> Ah! Good to know. >>> >>> Two node clusters >>> Ordinarily, the loss of quorum after one out of two nodes fails will prevent the remaining node from continuing (if both nodes have one vote.) Special configuration options can be set to allow the one remaining node to continue operating if the other >>> fails. To do this only two nodes, each with one vote, can be defined in cluster.conf. The two_node and expected_votes values must then be set to 1 in the cman section as follows. >>> >>> >>> >>> >>> One thing thats not clear to me is what happens when a single node comes up and can only see itself. >>> Does it get quorum or is it like wait-for-all in corosync2? >>> >> >> >> There's no wait_for_all in cman. The first node up will attempt (after >> fence_join_delay) the other node in an attempt to stop a split brain. >> >> This is one of several reasons why we insist that the fencing is on a >> separate network to heartbeat on a two_node cluster. >> >> >> Chrissie >> >>>> >>>> Chrissie >>>> >>>> >>>>> (Short version, it should do what you want) >>>>> >>>>>> 2. Yes, fencing is part of our plan, but not at this time. In the configurations we are testing, fencing is a RFPITA. >>>>>> 3. We could move up. We like Ubuntu 12.04 LTS because it is Long Term Support. But, we've upgraded packages as necessary. So, if we move to the latest stable Pacemaker, Cman and Corosync (and others?), how could this help? >>>>> >>>>> Well you might get 3+ years of bug fixes and performance improvements :-) >>>>> >>>>>> >>>>>> Is there a way to get the clustering software to 'poll' faster? I mean, this NIC stalling at boot time only lasts about 2 seconds beyond the start of corosync. But, its 30 more seconds before the nodes see each other. I see lots of parameters in the totem directive that seem interesting. Would any of them be appropriate. >>>>> >>>>> Is there not a way to tell upstart not to start the cluster until the network is up? >>>>> >>>>>> >>>>>> Andrew: Thanks for the prompt response. >>>>>> >>>>>> >>>>>> Regards. >>>>>> Mark K Vallevand >>>>>> >>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>> -Will Rogers >>>>>> >>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Andrew Beekhof >>>>>> Sent: Tuesday, September 16, 2014 08:51 PM >>>>>> To: linux clustering >>>>>> Subject: Re: [Linux-cluster] Cman (and corosync) starting before network interface is ready >>>>>> >>>>>> >>>>>> On 17 Sep 2014, at 7:20 am, Vallevand, Mark K wrote: >>>>>> >>>>>>> It looks like there is some odd delay in getting a network interface up and ready. So, when cman starts corosync, it can't get to the cluster. So, for a time, the node is a member of a cluster-of-one. The cluster-of-one begins starting resources. >>>>>> >>>>>> 1. enable two-node mode in cluster.conf (man page should indicate where/how) then disable no-quorum-policy=ignore >>>>>> 2. configure fencing >>>>>> 3. find a newer version of pacemaker, we're up to .12 now >>>>>> >>>>>>> A few seconds later, when the interface finally is up and ready, it takes about 30 more seconds for the cluster-of-one to finally rejoin the larger cluster. The doubly-started resources are sorted out and all ends up OK. >>>>>>> >>>>>>> Now, this is not a good thing to have these particular resources running twice. I'd really like the clustering software to behave better. But, I'm not sure what 'behave better' would be. >>>>>>> >>>>>>> Is it possible to introduce a delay into cman or corosync startup? Is that even wise? >>>>>>> Is there a parameter to get the clustering software to poll more often when it can't rejoin the cluster? >>>>>>> >>>>>>> Any suggestions would be welcome. >>>>>>> >>>>>>> Running Ubuntu 12.04 LTS. Pacemaker 1.1.6. Cman 3.1.7. Corosync 1.4.2. >>>>>>> >>>>>>> Regards. >>>>>>> Mark K Vallevand >>>>>>> "If there are no dogs in Heaven, then when I die I want to go where they went." >>>>>>> -Will Rogers >>>>>>> >>>>>>> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers. >>>>>>> -- >>>>>>> Linux-cluster mailing list >>>>>>> Linux-cluster at redhat.com >>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>>> >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster at redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>> >> > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster Hi, let me add something more: The "fence_join_delay" option is used to avoid a condition called "dual fencing" which can leaves your cluster entirely powered down. "Dual fencing" can suscced if you have a two nodes cluster with no quorum and you are using fencing through IPMI agents. So both nodes would try to execute fencing actions against each other using the IPMI interface. "fence_join_delay" works adding a countdown to avoid to the remain node execute fencing actions for X time, just the first node can send IPMI command to the other one. Best regards. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://codigounix.blogspot.com/ http://twitter.com/_tty0 GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook From lipson12 at yahoo.com Sun Sep 21 06:06:03 2014 From: lipson12 at yahoo.com (Kaisar Ahmed Khan) Date: Sat, 20 Sep 2014 23:06:03 -0700 Subject: [Linux-cluster] GFS2 mount problem Message-ID: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> Dear All, I have been experiencing a problem for long time in GFS2 with three node cluster. Short brief about my scenario All three nodes in a Host with KVM technology. storage accessing by iSCSI on all three nodes. One 50GB LUN initiated on all three nodes , and configured GFS2 file system . GFS file system mounted at all three nodes persistently by fstab. Problem is: When I reboot/ fence any machine , I found GFS2 file system not mounted . it got mounted after applying # mount ?a Command . What possible cause of this problem. ? Thanks Kaisar -------------- next part -------------- An HTML attachment was scrubbed... URL: From vijaykakkars at gmail.com Sun Sep 21 13:53:36 2014 From: vijaykakkars at gmail.com (Vijay Kakkar) Date: Sun, 21 Sep 2014 19:23:36 +0530 Subject: [Linux-cluster] GFS2 mount problem In-Reply-To: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> References: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> Message-ID: Hi, Can you share the mount point information of /etc/fstab ? On Sun, Sep 21, 2014 at 11:36 AM, Kaisar Ahmed Khan wrote: > > Dear All, > > I have been experiencing a problem for long time in GFS2 with three node > cluster. > > Short brief about my scenario > All three nodes in a Host with KVM technology. storage accessing by iSCSI > on all three nodes. > One 50GB LUN initiated on all three nodes , and configured GFS2 file > system . > GFS file system mounted at all three nodes persistently by fstab. > > Problem is: > When I reboot/ fence any machine , I found GFS2 file system not mounted . > it got mounted after applying # mount -a Command . > > What possible cause of this problem. ? > > Thanks > Kaisar > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Regards, *Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}* Techgrills Systems Pvt. Ltd. E4,3rd Floor, South EX Part I, New Delhi,110049 011-46521313 | +919999103657 Singapore: +6593480537 Australia: +61426044312 http://lnkd.in/bnj2VUU http://www.facebook.com/techgrills -------------- next part -------------- An HTML attachment was scrubbed... URL: From fmdlc.unix at gmail.com Sun Sep 21 15:29:32 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Sun, 21 Sep 2014 12:29:32 -0300 Subject: [Linux-cluster] GFS2 mount problem In-Reply-To: References: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> Message-ID: <4DD3E16E-22B0-4370-A226-617A8E94300D@gmail.com> Hi, It's looks like latency at the CLVMD/IP resource initialization or the node don't joined to the fencing domaln. It's because while the cluster is booting, the operative system try to mount the filesystem, but it can't find the filesystem because the service clvmd or IP service don't started yet or there is some latency trying to configure fencing. Just check the boot logs (/var/log/boot.log, /var/log/messages and dmesg) for errors or information. If you configured clvmd as cluster resource, you can try for change (just for testing) clvmd to start through chkconfig and not as cluster resource then you reduce the services number starting with the cluster. Check your fstab too. Bests Sent from my IPhone > On Sep 21, 2014, at 10:53, Vijay Kakkar wrote: > > Hi, > > Can you share the mount point information of /etc/fstab ? > >> On Sun, Sep 21, 2014 at 11:36 AM, Kaisar Ahmed Khan wrote: >> >> Dear All, >> >> I have been experiencing a problem for long time in GFS2 with three node cluster. >> >> Short brief about my scenario >> All three nodes in a Host with KVM technology. storage accessing by iSCSI on all three nodes. >> One 50GB LUN initiated on all three nodes , and configured GFS2 file system . >> GFS file system mounted at all three nodes persistently by fstab. >> >> Problem is: >> When I reboot/ fence any machine , I found GFS2 file system not mounted . it got mounted after applying # mount ?a Command . >> >> What possible cause of this problem. ? >> >> Thanks >> Kaisar >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Regards, > > Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X} > > Techgrills Systems Pvt. Ltd. > E4,3rd Floor, > South EX Part I, > New Delhi,110049 > 011-46521313 | +919999103657 > Singapore: +6593480537 > Australia: +61426044312 > http://lnkd.in/bnj2VUU > http://www.facebook.com/techgrills > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From fmdlc.unix at gmail.com Sun Sep 21 15:49:18 2014 From: fmdlc.unix at gmail.com (Facundo M. de la Cruz) Date: Sun, 21 Sep 2014 12:49:18 -0300 Subject: [Linux-cluster] GFS2 mount problem In-Reply-To: References: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> Message-ID: <20FA64F1-345E-459E-BCCE-0F6687FA44AC@gmail.com> Hi again, Sorry, but do you added the _netdev option to /etc/fstab file?. Bests. -- Facundo M. de la Cruz (tty0) Information Technology Specialist Movil: +54 911 56528301 http://www.codigounix.com.ar/ GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 "Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.? - Rich Cook On Sep 21, 2014, at 10:53, Vijay Kakkar wrote: > Hi, > > Can you share the mount point information of /etc/fstab ? > > On Sun, Sep 21, 2014 at 11:36 AM, Kaisar Ahmed Khan wrote: > > Dear All, > > I have been experiencing a problem for long time in GFS2 with three node cluster. > > Short brief about my scenario > All three nodes in a Host with KVM technology. storage accessing by iSCSI on all three nodes. > One 50GB LUN initiated on all three nodes , and configured GFS2 file system . > GFS file system mounted at all three nodes persistently by fstab. > > Problem is: > When I reboot/ fence any machine , I found GFS2 file system not mounted . it got mounted after applying # mount ?a Command . > > What possible cause of this problem. ? > > Thanks > Kaisar > > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Regards, > > Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X} > > Techgrills Systems Pvt. Ltd. > E4,3rd Floor, > South EX Part I, > New Delhi,110049 > 011-46521313 | +919999103657 > Singapore: +6593480537 > Australia: +61426044312 > http://lnkd.in/bnj2VUU > http://www.facebook.com/techgrills > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: Message signed with OpenPGP using GPGMail URL: From emi2fast at gmail.com Sun Sep 21 20:17:42 2014 From: emi2fast at gmail.com (emmanuel segura) Date: Sun, 21 Sep 2014 22:17:42 +0200 Subject: [Linux-cluster] GFS2 mount problem In-Reply-To: <20FA64F1-345E-459E-BCCE-0F6687FA44AC@gmail.com> References: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> <20FA64F1-345E-459E-BCCE-0F6687FA44AC@gmail.com> Message-ID: i think too, using the _netdev resolve the issue 2014-09-21 17:49 GMT+02:00 Facundo M. de la Cruz : > Hi again, > > Sorry, but do you added the _netdev option to /etc/fstab file?. > > Bests. > -- > Facundo M. de la Cruz (tty0) > Information Technology Specialist > Movil: +54 911 56528301 > > http://www.codigounix.com.ar/ > > GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 > > "Programming today is a race between software engineers striving to build > bigger and better idiot-proof programs, and the Universe trying to produce > bigger and better idiots. So far, the Universe is winning.? - Rich Cook > > On Sep 21, 2014, at 10:53, Vijay Kakkar wrote: > > Hi, > > Can you share the mount point information of /etc/fstab ? > > On Sun, Sep 21, 2014 at 11:36 AM, Kaisar Ahmed Khan > wrote: >> >> >> Dear All, >> >> I have been experiencing a problem for long time in GFS2 with three node >> cluster. >> >> Short brief about my scenario >> All three nodes in a Host with KVM technology. storage accessing by iSCSI >> on all three nodes. >> One 50GB LUN initiated on all three nodes , and configured GFS2 file >> system . >> GFS file system mounted at all three nodes persistently by fstab. >> >> Problem is: >> When I reboot/ fence any machine , I found GFS2 file system not mounted . >> it got mounted after applying # mount ?a Command . >> >> What possible cause of this problem. ? >> >> Thanks >> Kaisar >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Regards, > > Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X} > > Techgrills Systems Pvt. Ltd. > E4,3rd Floor, > South EX Part I, > New Delhi,110049 > 011-46521313 | +919999103657 > Singapore: +6593480537 > Australia: +61426044312 > http://lnkd.in/bnj2VUU > http://www.facebook.com/techgrills > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera From wferi at niif.hu Mon Sep 22 08:24:37 2014 From: wferi at niif.hu (Ferenc Wagner) Date: Mon, 22 Sep 2014 10:24:37 +0200 Subject: [Linux-cluster] ordering scores and kinds Message-ID: <87mw9s2hbu.fsf@lant.ki.iif.hu> Hi, http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html says that optional ordering is achieved by setting the "kind" attribute to "Optional". However, the next section http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_advisory_ordering.html says that advisory ordering is achieved by setting the "score" attribute to 0. Is there any difference between an optional and an advisory ordering constraint? How do nonzero score values influence cluster behaviour, if at all? Or is the kind attribute intended to replace all score settings on ordering constraints? -- Thanks, Feri. From lipson12 at yahoo.com Mon Sep 22 09:23:36 2014 From: lipson12 at yahoo.com (Kaisar Ahmed Khan) Date: Mon, 22 Sep 2014 02:23:36 -0700 Subject: [Linux-cluster] GFS2 mount problem In-Reply-To: References: <1411279563.11775.YahooMailNeo@web141202.mail.bf1.yahoo.com> <20FA64F1-345E-459E-BCCE-0F6687FA44AC@gmail.com> Message-ID: <1411377816.33666.YahooMailNeo@web141204.mail.bf1.yahoo.com> /dev/vg1/lv1 /gfsha gfs2 quota=on,acl 0 0 Regards, kaisar On Monday, September 22, 2014 2:29 AM, emmanuel segura wrote: i think too, using the _netdev resolve the issue 2014-09-21 17:49 GMT+02:00 Facundo M. de la Cruz : > Hi again, > > Sorry, but do you added the _netdev option to /etc/fstab file?. > > Bests. > -- > Facundo M. de la Cruz (tty0) > Information Technology Specialist > Movil: +54 911 56528301 > > http://www.codigounix.com.ar/ > > GPG fingerprint: DF2F 514A 5167 00F5 C753 BF3B D797 C8E1 5726 0789 > > "Programming today is a race between software engineers striving to build > bigger and better idiot-proof programs, and the Universe trying to produce > bigger and better idiots. So far, the Universe is winning.? - Rich Cook > > On Sep 21, 2014, at 10:53, Vijay Kakkar wrote: > > Hi, > > Can you share the mount point information of /etc/fstab ? > > On Sun, Sep 21, 2014 at 11:36 AM, Kaisar Ahmed Khan > wrote: >> >> >> Dear All, >> >> I have been experiencing a problem for long time in GFS2 with three node >> cluster. >> >> Short brief about my scenario >> All three nodes in a Host with KVM technology. storage accessing by iSCSI >> on all three nodes. >> One 50GB LUN initiated on all three nodes , and configured GFS2 file >> system . >> GFS file system mounted at all three nodes persistently by fstab. >> >> Problem is: >> When I reboot/ fence any machine , I found GFS2 file system not mounted . >> it got mounted after applying # mount ?a Command . >> >> What possible cause of this problem. ? >> >> Thanks >> Kaisar >> >> >> >> >> -- >> Linux-cluster mailing list >> Linux-cluster at redhat.com >> https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Regards, > > Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X} > > Techgrills Systems Pvt. Ltd. > E4,3rd Floor, > South EX Part I, > New Delhi,110049 > 011-46521313 | +919999103657 > Singapore: +6593480537 > Australia: +61426044312 > http://lnkd.in/bnj2VUU > http://www.facebook.com/techgrills > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera -- Linux-cluster mailing list Linux-cluster at redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at alteeve.ca Mon Sep 22 18:25:18 2014 From: lists at alteeve.ca (Digimer) Date: Mon, 22 Sep 2014 14:25:18 -0400 Subject: [Linux-cluster] [Pacemaker] [RFC] Organizing HA Summit 2015 In-Reply-To: <540D853F.3090109@redhat.com> References: <540D853F.3090109@redhat.com> Message-ID: <5420698E.7050206@alteeve.ca> On 08/09/14 06:30 AM, Fabio M. Di Nitto wrote: > All, > > it's been almost 6 years since we had a face to face meeting for all > developers and vendors involved in Linux HA. > > I'd like to try and organize a new event and piggy-back with DevConf in > Brno [1]. > > DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices. > > My suggestion would be to have a 2 days dedicated HA summit the 4th and > the 5th of February. > > The goal for this meeting is to, beside to get to know each other and > all social aspect of those events, tune the directions of the various HA > projects and explore common areas of improvements. > > I am also very open to the idea of extending to 3 days, 1 one dedicated > to customers/users and 2 dedicated to developers, by starting the 3rd. > > Thoughts? > > Fabio > > PS Please hit reply all or include me in CC just to make sure I'll see > an answer :) > > [1] http://devconf.cz/ How is this looking? -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? From andrew at beekhof.net Mon Sep 29 07:14:37 2014 From: andrew at beekhof.net (Andrew Beekhof) Date: Mon, 29 Sep 2014 17:14:37 +1000 Subject: [Linux-cluster] ordering scores and kinds In-Reply-To: <87mw9s2hbu.fsf@lant.ki.iif.hu> References: <87mw9s2hbu.fsf@lant.ki.iif.hu> Message-ID: <20426E00-7F09-4F54-B9AE-1ADDDC3CBDDC@beekhof.net> On 22 Sep 2014, at 6:24 pm, Ferenc Wagner wrote: > Hi, > > http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html > says that optional ordering is achieved by setting the "kind" attribute > to "Optional". However, the next section > http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_advisory_ordering.html > says that advisory ordering is achieved by setting the "score" attribute > to 0. Is there any difference between an optional and an advisory > ordering constraint? No. kind=optional is the newer syntax that was intended to be more human friendly > How do nonzero score values influence cluster > behaviour, if at all? score > 0 is equivalent to kind=mandatory > Or is the kind attribute intended to replace all > score settings on ordering constraints? yes > -- > Thanks, > Feri. > > -- > Linux-cluster mailing list > Linux-cluster at redhat.com > https://www.redhat.com/mailman/listinfo/linux-cluster -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 841 bytes Desc: Message signed with OpenPGP using GPGMail URL: From wferi at niif.hu Tue Sep 30 19:23:17 2014 From: wferi at niif.hu (Ferenc Wagner) Date: Tue, 30 Sep 2014 21:23:17 +0200 Subject: [Linux-cluster] ordering scores and kinds In-Reply-To: <20426E00-7F09-4F54-B9AE-1ADDDC3CBDDC@beekhof.net> (Andrew Beekhof's message of "Mon, 29 Sep 2014 17:14:37 +1000") References: <87mw9s2hbu.fsf@lant.ki.iif.hu> <20426E00-7F09-4F54-B9AE-1ADDDC3CBDDC@beekhof.net> Message-ID: <87fvf8ewuy.fsf@lant.ki.iif.hu> Andrew Beekhof writes: > On 22 Sep 2014, at 6:24 pm, Ferenc Wagner wrote: > >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-ordering.html >> says that optional ordering is achieved by setting the "kind" attribute >> to "Optional". However, the next section >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_advisory_ordering.html >> says that advisory ordering is achieved by setting the "score" attribute >> to 0. Is there any difference between an optional and an advisory >> ordering constraint? > > No. kind=optional is the newer syntax that was intended to be more > human friendly > >> How do nonzero score values influence cluster behaviour, if at all? > > score > 0 is equivalent to kind=mandatory > >> Or is the kind attribute intended to replace all score settings on >> ordering constraints? > > yes Great, thanks! Please consider adding this info to the documentation (even knowing the history can be comforting, as the old syntax will never vanish from the internet). And please also specify what is the default kind if both of the kind and score attributes are missing. -- Thanks, Feri.