From mgrac at redhat.com  Mon Apr  6 17:27:12 2015
From: mgrac at redhat.com (Marek "marx" Grac)
Date: Mon, 06 Apr 2015 19:27:12 +0200
Subject: [Linux-cluster] fence-agents-4.0.17 stable release
Message-ID: <5522C1F0.4010401@redhat.com>

Welcome to the fence-agents 4.0.17 release

This release includes several bugfixes and features:

* HP iLO2 with firmware 2.27 has broken implementation of TLS negotation 
and SSLv3 is disabled by default (POODLE attack). Options --tls1.0 
(tls1.0 on stdin) was added to force using TLS v1.0. This options allows 
users to use that firmware with fence agents.

* Fence agent for AMT password was not put correctly into environment.

* Fix login process on bladecenter where 'last login' can occur in 
message of the day what mislead fence agent.

* Cipher for fence_ipmilan was previously set to 0. It was found out 
that this not good default value, we will use default value (3) of 
ipmitool instead.


Git repository can be found at https://github.com/ClusterLabs/fence-agents/

The new source tarball can be downloaded here:

https://github.com/ClusterLabs/fence-agents/archive/v4.0.17.tar.gz

To report bugs or issues:

https://bugzilla.redhat.com/

Would you like to meet the cluster team or members of its community?

    Join us on IRC (irc.freenode.net #linux-cluster) and share your
    experience  with other sysadministrators or power users.

Thanks/congratulations to all people that contributed to achieve this
great milestone.

m,


From anprice at redhat.com  Tue Apr  7 17:03:41 2015
From: anprice at redhat.com (Andrew Price)
Date: Tue, 07 Apr 2015 18:03:41 +0100
Subject: [Linux-cluster] gfs2-utils 3.1.8 released
Message-ID: <55240DED.5010608@redhat.com>

Hi,

I am happy to announce the 3.1.8 release of gfs2-utils. This release 
includes the following visible changes:

   * Performance improvements in fsck.gfs2, mkfs.gfs2 and gfs2_edit 
savemeta.
   * Better checking of journals, the jindex, system inodes and inode 
'goal' values in fsck.gfs2
   * gfs2_jadd and gfs2_grow are now separate programs instead of 
symlinks to mkfs.gfs2.
   * Improved test suite and related documentation.
   * No longer clobbers the configure script's --sbindir option.
   * No longer depends on perl.
   * Various minor bug fixes and enhancements.

See below for a complete list of changes. The source tarball is 
available from:
   https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.8.tar.gz

Please test, and report bugs against the gfs2-utils component of Fedora 
rawhide:
 
https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide

Regards,
Andy

Changes since version 3.1.7:

Abhi Das (6):
       fsck.gfs2: fix broken i_goal values in inodes
       gfs2_convert: use correct i_goal values instead of zeros for inodes
       tests: test for incorrect inode i_goal values
       mkfs.gfs2: addendum to fix broken i_goal values in inodes
       gfs2_utils: more gfs2_convert i_goal fixes
       gfs2-utils: more fsck.gfs2 i_goal fixes

Andrew Price (58):
       gfs2-utils tests: Build unit tests with consistent cpp flags
       libgfs2: Move old rgrp layout functions into fsck.gfs2
       gfs2-utils build: Add test coverage option
       fsck.gfs2: Fix memory leak in pass2
       gfs2_convert: Fix potential memory leaks in adjust_inode
       gfs2_edit: Fix signed value used as array index in print_ld_blks
       gfs2_edit: Set umask before calling mkstemp in savemetaopen()
       gfs2_edit: Fix use-after-free in find_wrap_pt
       libgfs2: Clean up broken rgrp length check
       libgfs2: Remove superfluous NULL check from gfs2_rgrp_free
       libgfs2: Fail fd comparison if the fds are negative
       libgfs2: Fix check for O_RDONLY
       fsck.gfs2: Remove dead code from scan_inode_list
       mkfs.gfs2: Terminate lockproto and locktable strings explicitly
       libgfs2: Add generic field assignment and print functions
       gfs2_edit: Use metadata description to print and assign fields
       gfs2l: Switch to lgfs2_field_assign
       libgfs2: Remove device_name from struct gfs2_sbd
       libgfs2: Remove path_name from struct gfs2_sbd
       libgfs2: metafs_path improvements
       gfs2_grow: Don't use PATH_MAX in main_grow
       gfs2_jadd: Don't use fixed size buffers for paths
       libgfs2: Remove orig_journals from struct gfs2_sbd
       gfs2l: Check unchecked returns in openfs
       gfs2-utils configure: Fix exit with failure condition
       gfs2-utils configure: Remove checks for non-existent -W flags
       gfs2_convert: Don't use a fixed sized buffer for device path
       gfs2_edit: Add bounds checking for the journalN keyword
       libgfs2: Make find_good_lh and jhead_scan static
       Build gfs2_grow, gfs2_jadd and mkfs.gfs2 separately
       gfs2-utils: Honour --sbindir
       gfs2-utils configure: Use AC_HELP_STRING in help messages
       fsck.gfs2: Improve reporting of pass timings
       mkfs.gfs2: Revert default resource group size
       gfs2-utils tests: Add keywords to tests
       gfs2-utils tests: Shorten TESTSUITEFLAGS to TOPTS
       gfs2-utils tests: Improve docs
       gfs2-utils tests: Skip unit tests if check is not found
       gfs2-utils tests: Document usage of convenience macros
       fsck.gfs2: Fix 'initializer element is not constant' build error
       fsck.gfs2: Simplify bad_journalname
       gfs2-utils build: Add a configure script summary
       mkfs.gfs2: Remove unused declarations
       gfs2-utils/tests: Fix unit tests for older check libraries
       fsck.gfs2: Fix memory leaks in pass1_process_rgrp
       libgfs2: Use the correct parent for rgrp tree insertion
       libgfs2: Remove some obsolete function declarations
       gfs2-utils: Move metafs handling into gfs2/mkfs/
       gfs2_grow/jadd: Use a matching context mount option in 
mount_gfs2_meta
       gfs2_edit savemeta: Don't read rgrps twice
       fsck.gfs2: Fetch directory inodes early in pass2()
       libgfs2: Remove some unused data structures
       gfs2-utils: Tidy up Makefile.am files
       gfs2-utils build: Remove superfluous passive header checks
       gfs2-utils: Consolidate some "bad constants" strings
       gfs2-utils: Update translation template
       libgfs2: Fix potential NULL deref in linked_leaf_search()
       gfs2_grow: Put back the definition of FALLOC_FL_KEEP_SIZE

Bob Peterson (15):
       fsck.gfs2: Detect and correct corrupt journals
       fsck.gfs2: Change basic dentry checks for too long of file names
       fsck.gfs2: Print out block number when pass3 finds a bad directory
       fsck.gfs2: Adjust when hash table is doubled
       fsck.gfs2: Revise "undo" processing
       fsck.gfs2: remove duplicate designation during undo
       fsck.gfs2: Fix a use-after-free in pass2
       fsck.gfs2: fix double-free bug
       fsck.gfs2: Reprocess nodes if anything changed
       fsck.gfs2: Rebuild system files if they don't have the SYS bit set
       fsck.gfs2: Check the integrity of the journal index
       fsck.gfs2: rgrp block count reform
       fsck.gfs2: Change block_map to match bitmap
       fsck.gfs2: Fix journal sequence number reporting problem
       fsck.gfs2: Fix coverity error in pass4.c


From daniel.dehennin at baby-gnu.org  Wed Apr  1 12:47:30 2015
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Wed, 01 Apr 2015 14:47:30 +0200
Subject: [Linux-cluster] [ClusterLabs] dlm_controld and fencing issue
Message-ID: <87h9srlv48.fsf@hati.baby-gnu.org>

Hello,

On a 4 nodes OpenNebula cluster, running Ubuntu Trusty 14.04.2, with:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.3
- dlm 4.0.1-0ubuntu1

Here is the node list with their IDs, to follow the logs:

- 1084811137 nebula1
- 1084811138 nebula2
- 1084811139 nebula3
- 1084811140 nebula4 (the actual DC)

I have an issue where fencing is working but dlm always wait for
fencing, I needed to manually run ?dlm_tool fence_ack 1084811138? this
morning, here are the logs:

Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time 1427844569 fence_all dlm_stonith
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140 walltime 1427844569 local 50759
Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
[...]
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing


The stonith actually worked:

Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to fence (reboot) 'nebula2' with device '(any)'
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula1-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula2-IPMILAN can fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend can not fence nebula2: static-list
Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula3-IPMILAN can not fence nebula2: static-list
Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by nebula3 for crmd.6490 at nebula4.39eaf3a2: OK

I attache the logs of the DC nebula4 around from 01:29:03, where
everything worked fine (Got 4 replies, expecting: 4) to a little bit
after.

To me, it looks like:

- dlm ask for fencing directly at 01:29:29, the node was fenced since it
  had garbage in its /var/log/syslog exactely at 01:29.29, plus its
  uptime, but did not get a good response

- pacemaker fence nebula2 at 01:29:30 because it's not part of the
  cluster anymore (since 01:29:26 [TOTEM ] ... Members left: 1084811138)
  This fencing works.

Do you have any idea?

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nebula2-down-2015-01-04.log
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150401/449f336e/attachment.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150401/449f336e/attachment.sig>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150401/449f336e/attachment.ksh>

From lists at alteeve.ca  Wed Apr  8 00:45:13 2015
From: lists at alteeve.ca (Digimer)
Date: Tue, 07 Apr 2015 20:45:13 -0400
Subject: [Linux-cluster] gfs2-utils 3.1.8 released
In-Reply-To: <55240DED.5010608@redhat.com>
References: <55240DED.5010608@redhat.com>
Message-ID: <55247A19.1000206@alteeve.ca>

Hi Andrew,

  Congrats!!

  Want to add the cluster labs mailing list to your list of release
announcement locations?

digimer

On 07/04/15 01:03 PM, Andrew Price wrote:
> Hi,
> 
> I am happy to announce the 3.1.8 release of gfs2-utils. This release
> includes the following visible changes:
> 
>   * Performance improvements in fsck.gfs2, mkfs.gfs2 and gfs2_edit
> savemeta.
>   * Better checking of journals, the jindex, system inodes and inode
> 'goal' values in fsck.gfs2
>   * gfs2_jadd and gfs2_grow are now separate programs instead of
> symlinks to mkfs.gfs2.
>   * Improved test suite and related documentation.
>   * No longer clobbers the configure script's --sbindir option.
>   * No longer depends on perl.
>   * Various minor bug fixes and enhancements.
> 
> See below for a complete list of changes. The source tarball is
> available from:
>   https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.8.tar.gz
> 
> Please test, and report bugs against the gfs2-utils component of Fedora
> rawhide:
> 
> https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide
> 
> 
> Regards,
> Andy
> 
> Changes since version 3.1.7:
> 
> Abhi Das (6):
>       fsck.gfs2: fix broken i_goal values in inodes
>       gfs2_convert: use correct i_goal values instead of zeros for inodes
>       tests: test for incorrect inode i_goal values
>       mkfs.gfs2: addendum to fix broken i_goal values in inodes
>       gfs2_utils: more gfs2_convert i_goal fixes
>       gfs2-utils: more fsck.gfs2 i_goal fixes
> 
> Andrew Price (58):
>       gfs2-utils tests: Build unit tests with consistent cpp flags
>       libgfs2: Move old rgrp layout functions into fsck.gfs2
>       gfs2-utils build: Add test coverage option
>       fsck.gfs2: Fix memory leak in pass2
>       gfs2_convert: Fix potential memory leaks in adjust_inode
>       gfs2_edit: Fix signed value used as array index in print_ld_blks
>       gfs2_edit: Set umask before calling mkstemp in savemetaopen()
>       gfs2_edit: Fix use-after-free in find_wrap_pt
>       libgfs2: Clean up broken rgrp length check
>       libgfs2: Remove superfluous NULL check from gfs2_rgrp_free
>       libgfs2: Fail fd comparison if the fds are negative
>       libgfs2: Fix check for O_RDONLY
>       fsck.gfs2: Remove dead code from scan_inode_list
>       mkfs.gfs2: Terminate lockproto and locktable strings explicitly
>       libgfs2: Add generic field assignment and print functions
>       gfs2_edit: Use metadata description to print and assign fields
>       gfs2l: Switch to lgfs2_field_assign
>       libgfs2: Remove device_name from struct gfs2_sbd
>       libgfs2: Remove path_name from struct gfs2_sbd
>       libgfs2: metafs_path improvements
>       gfs2_grow: Don't use PATH_MAX in main_grow
>       gfs2_jadd: Don't use fixed size buffers for paths
>       libgfs2: Remove orig_journals from struct gfs2_sbd
>       gfs2l: Check unchecked returns in openfs
>       gfs2-utils configure: Fix exit with failure condition
>       gfs2-utils configure: Remove checks for non-existent -W flags
>       gfs2_convert: Don't use a fixed sized buffer for device path
>       gfs2_edit: Add bounds checking for the journalN keyword
>       libgfs2: Make find_good_lh and jhead_scan static
>       Build gfs2_grow, gfs2_jadd and mkfs.gfs2 separately
>       gfs2-utils: Honour --sbindir
>       gfs2-utils configure: Use AC_HELP_STRING in help messages
>       fsck.gfs2: Improve reporting of pass timings
>       mkfs.gfs2: Revert default resource group size
>       gfs2-utils tests: Add keywords to tests
>       gfs2-utils tests: Shorten TESTSUITEFLAGS to TOPTS
>       gfs2-utils tests: Improve docs
>       gfs2-utils tests: Skip unit tests if check is not found
>       gfs2-utils tests: Document usage of convenience macros
>       fsck.gfs2: Fix 'initializer element is not constant' build error
>       fsck.gfs2: Simplify bad_journalname
>       gfs2-utils build: Add a configure script summary
>       mkfs.gfs2: Remove unused declarations
>       gfs2-utils/tests: Fix unit tests for older check libraries
>       fsck.gfs2: Fix memory leaks in pass1_process_rgrp
>       libgfs2: Use the correct parent for rgrp tree insertion
>       libgfs2: Remove some obsolete function declarations
>       gfs2-utils: Move metafs handling into gfs2/mkfs/
>       gfs2_grow/jadd: Use a matching context mount option in
> mount_gfs2_meta
>       gfs2_edit savemeta: Don't read rgrps twice
>       fsck.gfs2: Fetch directory inodes early in pass2()
>       libgfs2: Remove some unused data structures
>       gfs2-utils: Tidy up Makefile.am files
>       gfs2-utils build: Remove superfluous passive header checks
>       gfs2-utils: Consolidate some "bad constants" strings
>       gfs2-utils: Update translation template
>       libgfs2: Fix potential NULL deref in linked_leaf_search()
>       gfs2_grow: Put back the definition of FALLOC_FL_KEEP_SIZE
> 
> Bob Peterson (15):
>       fsck.gfs2: Detect and correct corrupt journals
>       fsck.gfs2: Change basic dentry checks for too long of file names
>       fsck.gfs2: Print out block number when pass3 finds a bad directory
>       fsck.gfs2: Adjust when hash table is doubled
>       fsck.gfs2: Revise "undo" processing
>       fsck.gfs2: remove duplicate designation during undo
>       fsck.gfs2: Fix a use-after-free in pass2
>       fsck.gfs2: fix double-free bug
>       fsck.gfs2: Reprocess nodes if anything changed
>       fsck.gfs2: Rebuild system files if they don't have the SYS bit set
>       fsck.gfs2: Check the integrity of the journal index
>       fsck.gfs2: rgrp block count reform
>       fsck.gfs2: Change block_map to match bitmap
>       fsck.gfs2: Fix journal sequence number reporting problem
>       fsck.gfs2: Fix coverity error in pass4.c
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From anprice at redhat.com  Wed Apr  8 02:09:37 2015
From: anprice at redhat.com (Andrew Price)
Date: Wed, 08 Apr 2015 03:09:37 +0100
Subject: [Linux-cluster] gfs2-utils 3.1.8 released
In-Reply-To: <55247A19.1000206@alteeve.ca>
References: <55240DED.5010608@redhat.com> <55247A19.1000206@alteeve.ca>
Message-ID: <55248DE1.4040003@redhat.com>

On 08/04/15 01:45, Digimer wrote:
> Hi Andrew,
>
>    Congrats!!
>
>    Want to add the cluster labs mailing list to your list of release
> announcement locations?
>
> digimer

That's a great idea, I will. I haven't subscribed to the Cluster Labs 
list yet but I'm just about to :)

Thanks,
Andy

>
> On 07/04/15 01:03 PM, Andrew Price wrote:
>> Hi,
>>
>> I am happy to announce the 3.1.8 release of gfs2-utils. This release
>> includes the following visible changes:
>>
>>    * Performance improvements in fsck.gfs2, mkfs.gfs2 and gfs2_edit
>> savemeta.
>>    * Better checking of journals, the jindex, system inodes and inode
>> 'goal' values in fsck.gfs2
>>    * gfs2_jadd and gfs2_grow are now separate programs instead of
>> symlinks to mkfs.gfs2.
>>    * Improved test suite and related documentation.
>>    * No longer clobbers the configure script's --sbindir option.
>>    * No longer depends on perl.
>>    * Various minor bug fixes and enhancements.
>>
>> See below for a complete list of changes. The source tarball is
>> available from:
>>    https://fedorahosted.org/released/gfs2-utils/gfs2-utils-3.1.8.tar.gz
>>
>> Please test, and report bugs against the gfs2-utils component of Fedora
>> rawhide:
>>
>> https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide
>>
>>
>> Regards,
>> Andy
>>
>> Changes since version 3.1.7:
>>
>> Abhi Das (6):
>>        fsck.gfs2: fix broken i_goal values in inodes
>>        gfs2_convert: use correct i_goal values instead of zeros for inodes
>>        tests: test for incorrect inode i_goal values
>>        mkfs.gfs2: addendum to fix broken i_goal values in inodes
>>        gfs2_utils: more gfs2_convert i_goal fixes
>>        gfs2-utils: more fsck.gfs2 i_goal fixes
>>
>> Andrew Price (58):
>>        gfs2-utils tests: Build unit tests with consistent cpp flags
>>        libgfs2: Move old rgrp layout functions into fsck.gfs2
>>        gfs2-utils build: Add test coverage option
>>        fsck.gfs2: Fix memory leak in pass2
>>        gfs2_convert: Fix potential memory leaks in adjust_inode
>>        gfs2_edit: Fix signed value used as array index in print_ld_blks
>>        gfs2_edit: Set umask before calling mkstemp in savemetaopen()
>>        gfs2_edit: Fix use-after-free in find_wrap_pt
>>        libgfs2: Clean up broken rgrp length check
>>        libgfs2: Remove superfluous NULL check from gfs2_rgrp_free
>>        libgfs2: Fail fd comparison if the fds are negative
>>        libgfs2: Fix check for O_RDONLY
>>        fsck.gfs2: Remove dead code from scan_inode_list
>>        mkfs.gfs2: Terminate lockproto and locktable strings explicitly
>>        libgfs2: Add generic field assignment and print functions
>>        gfs2_edit: Use metadata description to print and assign fields
>>        gfs2l: Switch to lgfs2_field_assign
>>        libgfs2: Remove device_name from struct gfs2_sbd
>>        libgfs2: Remove path_name from struct gfs2_sbd
>>        libgfs2: metafs_path improvements
>>        gfs2_grow: Don't use PATH_MAX in main_grow
>>        gfs2_jadd: Don't use fixed size buffers for paths
>>        libgfs2: Remove orig_journals from struct gfs2_sbd
>>        gfs2l: Check unchecked returns in openfs
>>        gfs2-utils configure: Fix exit with failure condition
>>        gfs2-utils configure: Remove checks for non-existent -W flags
>>        gfs2_convert: Don't use a fixed sized buffer for device path
>>        gfs2_edit: Add bounds checking for the journalN keyword
>>        libgfs2: Make find_good_lh and jhead_scan static
>>        Build gfs2_grow, gfs2_jadd and mkfs.gfs2 separately
>>        gfs2-utils: Honour --sbindir
>>        gfs2-utils configure: Use AC_HELP_STRING in help messages
>>        fsck.gfs2: Improve reporting of pass timings
>>        mkfs.gfs2: Revert default resource group size
>>        gfs2-utils tests: Add keywords to tests
>>        gfs2-utils tests: Shorten TESTSUITEFLAGS to TOPTS
>>        gfs2-utils tests: Improve docs
>>        gfs2-utils tests: Skip unit tests if check is not found
>>        gfs2-utils tests: Document usage of convenience macros
>>        fsck.gfs2: Fix 'initializer element is not constant' build error
>>        fsck.gfs2: Simplify bad_journalname
>>        gfs2-utils build: Add a configure script summary
>>        mkfs.gfs2: Remove unused declarations
>>        gfs2-utils/tests: Fix unit tests for older check libraries
>>        fsck.gfs2: Fix memory leaks in pass1_process_rgrp
>>        libgfs2: Use the correct parent for rgrp tree insertion
>>        libgfs2: Remove some obsolete function declarations
>>        gfs2-utils: Move metafs handling into gfs2/mkfs/
>>        gfs2_grow/jadd: Use a matching context mount option in
>> mount_gfs2_meta
>>        gfs2_edit savemeta: Don't read rgrps twice
>>        fsck.gfs2: Fetch directory inodes early in pass2()
>>        libgfs2: Remove some unused data structures
>>        gfs2-utils: Tidy up Makefile.am files
>>        gfs2-utils build: Remove superfluous passive header checks
>>        gfs2-utils: Consolidate some "bad constants" strings
>>        gfs2-utils: Update translation template
>>        libgfs2: Fix potential NULL deref in linked_leaf_search()
>>        gfs2_grow: Put back the definition of FALLOC_FL_KEEP_SIZE
>>
>> Bob Peterson (15):
>>        fsck.gfs2: Detect and correct corrupt journals
>>        fsck.gfs2: Change basic dentry checks for too long of file names
>>        fsck.gfs2: Print out block number when pass3 finds a bad directory
>>        fsck.gfs2: Adjust when hash table is doubled
>>        fsck.gfs2: Revise "undo" processing
>>        fsck.gfs2: remove duplicate designation during undo
>>        fsck.gfs2: Fix a use-after-free in pass2
>>        fsck.gfs2: fix double-free bug
>>        fsck.gfs2: Reprocess nodes if anything changed
>>        fsck.gfs2: Rebuild system files if they don't have the SYS bit set
>>        fsck.gfs2: Check the integrity of the journal index
>>        fsck.gfs2: rgrp block count reform
>>        fsck.gfs2: Change block_map to match bitmap
>>        fsck.gfs2: Fix journal sequence number reporting problem
>>        fsck.gfs2: Fix coverity error in pass4.c
>>
>
>


From andrew at beekhof.net  Mon Apr 13 03:19:37 2015
From: andrew at beekhof.net (Andrew Beekhof)
Date: Mon, 13 Apr 2015 13:19:37 +1000
Subject: [Linux-cluster] [ClusterLabs] dlm_controld and fencing issue
In-Reply-To: <87h9srlv48.fsf@hati.baby-gnu.org>
References: <87h9srlv48.fsf@hati.baby-gnu.org>
Message-ID: <1A2FDA6D-1295-448C-99D2-F8BF8EC5C5E1@beekhof.net>


> On 1 Apr 2015, at 11:47 pm, Daniel Dehennin <daniel.dehennin at baby-gnu.org> wrote:
> 
> Hello,
> 
> On a 4 nodes OpenNebula cluster, running Ubuntu Trusty 14.04.2, with:
> 
> - corosync 2.3.3-1ubuntu1
> - pacemaker 1.1.10+git20130802-1ubuntu2.3
> - dlm 4.0.1-0ubuntu1
> 
> Here is the node list with their IDs, to follow the logs:
> 
> - 1084811137 nebula1
> - 1084811138 nebula2
> - 1084811139 nebula3
> - 1084811140 nebula4 (the actual DC)
> 
> I have an issue where fencing is working but dlm always wait for
> fencing, I needed to manually run ?dlm_tool fence_ack 1084811138? this
> morning, here are the logs:
> 
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time 1427844569 fence_all dlm_stonith
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
> [...]
> Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
> Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing
> 
> 
> The stonith actually worked:
> 
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to fence (reboot) 'nebula2' with device '(any)'
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula1-IPMILAN can not fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula2-IPMILAN can fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend can not fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula3-IPMILAN can not fence nebula2: static-list
> Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by nebula3 for crmd.6490 at nebula4.39eaf3a2: OK
> 
> I attache the logs of the DC nebula4 around from 01:29:03, where
> everything worked fine (Got 4 replies, expecting: 4) to a little bit
> after.
> 
> To me, it looks like:
> 
> - dlm ask for fencing directly at 01:29:29, the node was fenced since it
>  had garbage in its /var/log/syslog exactely at 01:29.29, plus its
>  uptime, but did not get a good response
> 
> - pacemaker fence nebula2 at 01:29:30 because it's not part of the
>  cluster anymore (since 01:29:26 [TOTEM ] ... Members left: 1084811138)
>  This fencing works.
> 
> Do you have any idea?

There were two important fixes in this area since 1.1.10

+ David Vossel (1 year, 1 month ago) 054fedf: Fix: stonith_api_time_helper now returns when the most recent fencing operation completed 
+ Andrew Beekhof (1 year, 1 month ago) d9921e5: Fix: Fencing: Pass the correct options when looking up the history by node name 

Whether 'pacemaker 1.1.10+git20130802-1ubuntu2.3? includes them is anybody?s guess.

> 
> Regards.
> -- 
> Daniel Dehennin
> R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
> 
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
> Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
> Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 31
> Apr  1 01:29:03 nebula4 lvm[6759]: check_all_clvmds_running
> Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
> Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
> Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811138, state = 3
> Apr  1 01:29:03 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
> Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
> Apr  1 01:29:03 nebula4 lvm[6759]: Writing status 0 down pipe 13
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting to do post command - state = 0
> Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: distribute command: XID = 43973, flags=0x0 ()
> Apr  1 01:29:03 nebula4 lvm[6759]: num_nodes = 4
> Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218f100. client=0x218eab0, msg=0x218ebc0, len=31, csid=(nil), xid=43973
> Apr  1 01:29:03 nebula4 lvm[6759]: Sending message to all cluster nodes
> Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: local
> Apr  1 01:29:03 nebula4 lvm[6759]: process_local_command: SYNC_NAMES (0x2d) msg=0x218ed00, msglen =31, client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
> Apr  1 01:29:03 nebula4 lvm[6759]: Got 1 replies, expecting: 4
> Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 0. len 31
> Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811140. len 18
> Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e782: 0 bytes
> Apr  1 01:29:03 nebula4 lvm[6759]: Got 2 replies, expecting: 4
> Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811140. len 18
> Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e783: 0 bytes
> Apr  1 01:29:03 nebula4 lvm[6759]: Got 3 replies, expecting: 4
> Apr  1 01:29:03 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811140. len 18
> Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e781: 0 bytes
> Apr  1 01:29:03 nebula4 lvm[6759]: Got 4 replies, expecting: 4
> Apr  1 01:29:03 nebula4 lvm[6759]: Got post command condition...
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
> Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
> Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 30
> Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
> Apr  1 01:29:03 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-2' at 6 (client=0x218eab0)
> Apr  1 01:29:03 nebula4 lvm[6759]: unlock_resource: V_vg-one-2 lockid: 1
> Apr  1 01:29:03 nebula4 lvm[6759]: Writing status 0 down pipe 13
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting to do post command - state = 0
> Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: distribute command: XID = 43974, flags=0x1 (LOCAL)
> Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x218eab0, msg=0x218ebc0, len=30, csid=(nil), xid=43974
> Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: local
> Apr  1 01:29:03 nebula4 lvm[6759]: process_local_command: LOCK_VG (0x33) msg=0x218ed40, msglen =30, client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: do_lock_vg: resource 'V_vg-one-2', cmd = 0x6 LCK_VG (UNLOCK|VG), flags = 0x0 ( ), critical_section = 0
> Apr  1 01:29:03 nebula4 lvm[6759]: Invalidating cached metadata for VG vg-one-2
> Apr  1 01:29:03 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
> Apr  1 01:29:03 nebula4 lvm[6759]: Got 1 replies, expecting: 1
> Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:03 nebula4 lvm[6759]: Got post command condition...
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for next pre command
> Apr  1 01:29:03 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:03 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:03 nebula4 lvm[6759]: Send local reply
> Apr  1 01:29:03 nebula4 lvm[6759]: Read on local socket 5, len = 0
> Apr  1 01:29:03 nebula4 lvm[6759]: EOF on local socket: inprogress=0
> Apr  1 01:29:03 nebula4 lvm[6759]: Waiting for child thread
> Apr  1 01:29:03 nebula4 lvm[6759]: Got pre command condition...
> Apr  1 01:29:03 nebula4 lvm[6759]: Subthread finished
> Apr  1 01:29:03 nebula4 lvm[6759]: Joined child thread
> Apr  1 01:29:03 nebula4 lvm[6759]: ret == 0, errno = 0. removing client
> Apr  1 01:29:03 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ebc0. client=0x218eab0, msg=(nil), len=0, csid=(nil), xid=43974
> Apr  1 01:29:03 nebula4 lvm[6759]: process_work_item: free fd -1
> Apr  1 01:29:03 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 39602 on node 40a8e782
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 44354 on node 40a8e781
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 39605 on node 40a8e782
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 44357 on node 40a8e781
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 39608 on node 40a8e782
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eab0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 44360 on node 40a8e781
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811138. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811138 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 0. len 31
> Apr  1 01:29:16 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218eea0. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:16 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:16 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 44363 on node 40a8e781
> Apr  1 01:29:16 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:16 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:16 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811137. len 18
> Apr  1 01:29:23 nebula4 lvm[6759]: Got new connection on fd 5
> Apr  1 01:29:23 nebula4 lvm[6759]: Read on local socket 5, len = 30
> Apr  1 01:29:23 nebula4 lvm[6759]: creating pipe, [12, 13]
> Apr  1 01:29:23 nebula4 lvm[6759]: Creating pre&post thread
> Apr  1 01:29:23 nebula4 lvm[6759]: Created pre&post thread, state = 0
> Apr  1 01:29:23 nebula4 lvm[6759]: in sub thread: client = 0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 1 (client=0x218eab0)
> Apr  1 01:29:23 nebula4 lvm[6759]: lock_resource 'V_vg-one-0', flags=0, mode=3
> Apr  1 01:29:23 nebula4 lvm[6759]: lock_resource returning 0, lock_id=1
> Apr  1 01:29:23 nebula4 lvm[6759]: Writing status 0 down pipe 13
> Apr  1 01:29:23 nebula4 lvm[6759]: Waiting to do post command - state = 0
> Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: distribute command: XID = 43975, flags=0x1 (LOCAL)
> Apr  1 01:29:23 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x218eab0, msg=0x218ebc0, len=30, csid=(nil), xid=43975
> Apr  1 01:29:23 nebula4 lvm[6759]: process_work_item: local
> Apr  1 01:29:23 nebula4 lvm[6759]: process_local_command: LOCK_VG (0x33) msg=0x218ed40, msglen =30, client=0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: do_lock_vg: resource 'V_vg-one-0', cmd = 0x1 LCK_VG (READ|VG), flags = 0x0 ( ), critical_section = 0
> Apr  1 01:29:23 nebula4 lvm[6759]: Invalidating cached metadata for VG vg-one-0
> Apr  1 01:29:23 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
> Apr  1 01:29:23 nebula4 lvm[6759]: Got 1 replies, expecting: 1
> Apr  1 01:29:23 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:23 nebula4 lvm[6759]: Got post command condition...
> Apr  1 01:29:23 nebula4 lvm[6759]: Waiting for next pre command
> Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: Send local reply
> Apr  1 01:29:23 nebula4 lvm[6759]: Read on local socket 5, len = 31
> Apr  1 01:29:23 nebula4 lvm[6759]: check_all_clvmds_running
> Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
> Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
> Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811138, state = 3
> Apr  1 01:29:23 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
> Apr  1 01:29:23 nebula4 lvm[6759]: Got pre command condition...
> Apr  1 01:29:23 nebula4 lvm[6759]: Writing status 0 down pipe 13
> Apr  1 01:29:23 nebula4 lvm[6759]: Waiting to do post command - state = 0
> Apr  1 01:29:23 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:29:23 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: distribute command: XID = 43976, flags=0x0 ()
> Apr  1 01:29:23 nebula4 lvm[6759]: num_nodes = 4
> Apr  1 01:29:23 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218f100. client=0x218eab0, msg=0x218ebc0, len=31, csid=(nil), xid=43976
> Apr  1 01:29:23 nebula4 lvm[6759]: Sending message to all cluster nodes
> Apr  1 01:29:23 nebula4 lvm[6759]: process_work_item: local
> Apr  1 01:29:23 nebula4 lvm[6759]: process_local_command: SYNC_NAMES (0x2d) msg=0x218ed00, msglen =31, client=0x218eab0
> Apr  1 01:29:23 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:23 nebula4 lvm[6759]: Reply from node 40a8e784: 0 bytes
> Apr  1 01:29:23 nebula4 lvm[6759]: Got 1 replies, expecting: 4
> Apr  1 01:29:23 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:26 nebula4 corosync[6411]:   [TOTEM ] A processor failed, forming new configuration.
> Apr  1 01:29:29 nebula4 corosync[6411]:   [TOTEM ] A new membership (192.168.231.129:1204) was formed. Members left: 1084811138
> Apr  1 01:29:29 nebula4 lvm[6759]: confchg callback. 0 joined, 1 left, 3 members
> Apr  1 01:29:29 nebula4 crmd[6490]:  warning: match_down_event: No match for shutdown action on 1084811138
> Apr  1 01:29:29 nebula4 crmd[6490]:   notice: peer_update_callback: Stonith/shutdown of nebula2 not matched
> Apr  1 01:29:29 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Apr  1 01:29:29 nebula4 corosync[6411]:   [QUORUM] Members[3]: 1084811137 1084811139 1084811140
> Apr  1 01:29:29 nebula4 corosync[6411]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Apr  1 01:29:29 nebula4 crmd[6490]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula2[1084811138] - state is now lost (was member)
> Apr  1 01:29:29 nebula4 crmd[6490]:  warning: match_down_event: No match for shutdown action on 1084811138
> Apr  1 01:29:29 nebula4 pacemakerd[6483]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula2[1084811138] - state is now lost (was member)
> Apr  1 01:29:29 nebula4 crmd[6490]:   notice: peer_update_callback: Stonith/shutdown of nebula2 not matched
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 0. len 31
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811137. len 18
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 0. len 31
> Apr  1 01:29:29 nebula4 lvm[6759]: add_to_lvmqueue: cmd=0x218ed00. client=0x6a1d60, msg=0x7fff260dfcac, len=31, csid=0x7fff260de67c, xid=0
> Apr  1 01:29:29 nebula4 lvm[6759]: process_work_item: remote
> Apr  1 01:29:29 nebula4 lvm[6759]: process_remote_command SYNC_NAMES (0x2d) for clientid 0x5000000 XID 43802 on node 40a8e783
> Apr  1 01:29:29 nebula4 lvm[6759]: Syncing device names
> Apr  1 01:29:29 nebula4 lvm[6759]: LVM thread waiting for work
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811139 for 1084811140. len 18
> Apr  1 01:29:29 nebula4 lvm[6759]: Reply from node 40a8e783: 0 bytes
> Apr  1 01:29:29 nebula4 lvm[6759]: Got 2 replies, expecting: 4
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811140. len 18
> Apr  1 01:29:29 nebula4 lvm[6759]: Reply from node 40a8e781: 0 bytes
> Apr  1 01:29:29 nebula4 lvm[6759]: Got 3 replies, expecting: 4
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811137 for 1084811139. len 18
> Apr  1 01:29:29 nebula4 lvm[6759]: 1084811140 got message from nodeid 1084811140 for 1084811139. len 18
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811137 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 kernel: [50799.162381] dlm: closing connection to node 1084811138
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811139 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 pid 44527 nodedown time 1427844569 fence_all dlm_stonith
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence result 1084811138 pid 44527 result 1 exit status
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence status 1084811138 receive 1 from 1084811140 walltime 1427844569 local 50759
> Apr  1 01:29:29 nebula4 dlm_controld[6737]: 50759 fence request 1084811138 no actor
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: pe_fence_node: Node nebula2 will be fenced because the node is no longer part of the cluster
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: determine_online_status: Node nebula2 is unclean
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula3: unknown error (1)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula1: unknown error (1)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula2: unknown error (1)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula2 after 1000000 failures (max=1000000)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_dlm:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_dlm:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_clvm:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_clvm:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_vg_one:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_vg_one:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_fs_one-datastores:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action p_fs_one-datastores:3_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: custom_action: Action stonith-nebula1-IPMILAN_stop_0 on nebula2 is unrunnable (offline)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: stage6: Scheduling Node nebula2 for STONITH
> Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_dlm:3#011(nebula2)
> Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_clvm:3#011(nebula2)
> Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_vg_one:3#011(nebula2)
> Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Stop    p_fs_one-datastores:3#011(nebula2)
> Apr  1 01:29:30 nebula4 pengine[6489]:   notice: LogActions: Move    stonith-nebula1-IPMILAN#011(Started nebula2 -> nebula3)
> Apr  1 01:29:30 nebula4 pengine[6489]:  warning: process_pe_message: Calculated Transition 101: /var/lib/pacemaker/pengine/pe-warn-22.bz2
> Apr  1 01:29:30 nebula4 crmd[6490]:   notice: te_fence_node: Executing reboot fencing operation (98) on nebula2 (timeout=30000)
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: handle_request: Client crmd.6490.2707e557 wants to fence (reboot) 'nebula2' with device '(any)'
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula2: 39eaf3a2-d7e0-417d-8a01-d2f373973d6b (0)
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula1-IPMILAN can not fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula2-IPMILAN can fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-one-frontend can not fence nebula2: static-list
> Apr  1 01:29:30 nebula4 stonith-ng[6486]:   notice: can_fence_host_with_device: stonith-nebula3-IPMILAN can not fence nebula2: static-list
> Apr  1 01:29:32 nebula4 stonith-ng[6486]:   notice: remote_op_done: Operation reboot of nebula2 by nebula3 for crmd.6490 at nebula4.39eaf3a2: OK
> Apr  1 01:29:32 nebula4 crmd[6490]:   notice: tengine_stonith_callback: Stonith operation 2/98:101:0:28913388-04df-49cb-9927-362b21a74014: OK (0)
> Apr  1 01:29:32 nebula4 crmd[6490]:   notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula3 for nebula4: OK (ref=39eaf3a2-d7e0-417d-8a01-d2f373973d6b) by client crmd.6490
> Apr  1 01:29:32 nebula4 crmd[6490]:   notice: te_rsc_command: Initiating action 91: start stonith-nebula1-IPMILAN_start_0 on nebula3
> Apr  1 01:29:33 nebula4 crmd[6490]:   notice: run_graph: Transition 101 (Complete=13, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-22.bz2): Stopped
> Apr  1 01:29:33 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula3: unknown error (1)
> Apr  1 01:29:33 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula1: unknown error (1)
> Apr  1 01:29:33 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
> Apr  1 01:29:33 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
> Apr  1 01:29:33 nebula4 pengine[6489]:   notice: process_pe_message: Calculated Transition 102: /var/lib/pacemaker/pengine/pe-input-129.bz2
> Apr  1 01:29:33 nebula4 crmd[6490]:   notice: te_rsc_command: Initiating action 88: monitor stonith-nebula1-IPMILAN_monitor_3600000 on nebula3
> Apr  1 01:29:34 nebula4 crmd[6490]:   notice: run_graph: Transition 102 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-129.bz2): Complete
> Apr  1 01:29:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr  1 01:30:01 nebula4 CRON[44640]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12 >/dev/null; fi)
> Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 datastores wait for fencing
> Apr  1 01:30:25 nebula4 dlm_controld[6737]: 50815 clvmd wait for fencing
> Apr  1 01:30:29 nebula4 lvm[6759]: Request timed-out (send: 1427844563, now: 1427844629)
> Apr  1 01:30:29 nebula4 lvm[6759]: Request timed-out. padding
> Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811137, state = 3
> Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e781
> Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811139, state = 3
> Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e783
> Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811138, state = 1
> Apr  1 01:30:29 nebula4 lvm[6759]: down_callback. node 1084811140, state = 3
> Apr  1 01:30:29 nebula4 lvm[6759]: Checking for a reply from 40a8e784
> Apr  1 01:30:29 nebula4 lvm[6759]: Got post command condition...
> Apr  1 01:30:29 nebula4 lvm[6759]: Waiting for next pre command
> Apr  1 01:30:29 nebula4 lvm[6759]: read on PIPE 12: 4 bytes: status: 0
> Apr  1 01:30:29 nebula4 lvm[6759]: background routine status was 0, sock_client=0x218eab0
> Apr  1 01:30:29 nebula4 lvm[6759]: Send local reply
> Apr  1 01:30:29 nebula4 lvm[6759]: Read on local socket 5, len = 30
> Apr  1 01:30:29 nebula4 lvm[6759]: Got pre command condition...
> Apr  1 01:30:29 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 6 (client=0x218eab0)
> Apr  1 01:30:29 nebula4 lvm[6759]: unlock_resource: V_vg-one-0 lockid: 1
> Apr  1 01:40:01 nebula4 CRON[47640]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12 >/dev/null; fi)
> Apr  1 01:44:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Apr  1 01:44:34 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula3: unknown error (1)
> Apr  1 01:44:34 nebula4 pengine[6489]:  warning: unpack_rsc_op: Processing failed op start for stonith-nebula4-IPMILAN on nebula1: unknown error (1)
> Apr  1 01:44:34 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula1 after 1000000 failures (max=1000000)
> Apr  1 01:44:34 nebula4 pengine[6489]:  warning: common_apply_stickiness: Forcing stonith-nebula4-IPMILAN away from nebula3 after 1000000 failures (max=1000000)
> Apr  1 01:44:34 nebula4 pengine[6489]:   notice: process_pe_message: Calculated Transition 103: /var/lib/pacemaker/pengine/pe-input-130.bz2
> Apr  1 01:44:34 nebula4 crmd[6490]:   notice: run_graph: Transition 103 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-130.bz2): Complete
> Apr  1 01:44:34 nebula4 crmd[6490]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr  1 01:45:01 nebula4 CRON[49089]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then munin-run apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then munin-run apt update 7200 12 >/dev/null; fi)
> Apr  1 01:46:01 nebula4 CRON[570]: (root) CMD (if test -x /usr/sbin/apticron; then /usr/sbin/apticron --cron; else true; fi)
> Apr  1 01:49:20 nebula4 lvm[6759]: Got new connection on fd 17
> Apr  1 01:49:20 nebula4 lvm[6759]: Read on local socket 17, len = 30
> Apr  1 01:49:20 nebula4 lvm[6759]: creating pipe, [18, 19]
> Apr  1 01:49:20 nebula4 lvm[6759]: Creating pre&post thread
> Apr  1 01:49:20 nebula4 lvm[6759]: Created pre&post thread, state = 0
> Apr  1 01:49:20 nebula4 lvm[6759]: in sub thread: client = 0x218f1f0
> Apr  1 01:49:20 nebula4 lvm[6759]: doing PRE command LOCK_VG 'V_vg-one-0' at 1 (client=0x218f1f0)
> Apr  1 01:49:20 nebula4 lvm[6759]: lock_resource 'V_vg-one-0', flags=0, mode=3
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


From thorvald.hallvardsson at gmail.com  Thu Apr 23 10:11:36 2015
From: thorvald.hallvardsson at gmail.com (Thorvald Hallvardsson)
Date: Thu, 23 Apr 2015 11:11:36 +0100
Subject: [Linux-cluster] GFS2 over NFS4
Message-ID: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>

Hi guys,

I need some help and answers related to share GFS2 file system over NFS. I
have read the RH documentation but still some things are a bit unclear to
me.

First of all I need to build a POC for the shared storage cluster which
initially will contain 3 nodes in the storage cluster. This is all going to
run as a VM environment on Hyper-V. Generally the idea is to share virtual
VHDX across 3 nodes, put LVM and GFS2 on top of it and then share it via
NFS to the clients. I have got the initial cluster built on Centos 7 using
pacemaker. I generally followed RH docs to build it so I ended up with the
simple GFS2 cluster and pacemaker managing fencing and floating VIP
resource.

Now I'm wondering about the NFS. RedHat documentation is a bit conflicting
or rather unclear in some places and I found quite few manuals on the
internet about similar configuration and generally some of them suggest to
mount the NFS share on the clients with nolock option RH docs mention local
flock and I got confused about what supposed to be where. Of course I don't
know if my understanding is correct but the reason to "disable" NFS locking
is because GFS2 is already doing it anyway via DLM so there is no need for
NFS to do same thing what eventually mean that I will have some sort of
double locking mechanism in place. So first question is where I suppose to
setup locks or rather no locks and how the export should look like ?

Second thing is I was thinking about going a step forward and use NFS4 for
the exports. However from what I have read about NFS4 it does locking by
default and there is no way to disable them. Does that mean NFS4 is not
suitable in this case at all ?

That's all for now.

I appreciate your help.

Thank you.
TH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150423/133d13b7/attachment.htm>

From fdinitto at redhat.com  Thu Apr 23 11:51:36 2015
From: fdinitto at redhat.com (Fabio M. Di Nitto)
Date: Thu, 23 Apr 2015 13:51:36 +0200
Subject: [Linux-cluster] GFS2 over NFS4
In-Reply-To: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>
References: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>
Message-ID: <5538DCC8.4000207@redhat.com>

On 04/23/2015 12:11 PM, Thorvald Hallvardsson wrote:
> Hi guys,
> 
> I need some help and answers related to share GFS2 file system over NFS.
> I have read the RH documentation but still some things are a bit unclear
> to me.
> 
> First of all I need to build a POC for the shared storage cluster which
> initially will contain 3 nodes in the storage cluster. This is all going
> to run as a VM environment on Hyper-V. Generally the idea is to share
> virtual VHDX across 3 nodes, put LVM and GFS2 on top of it and then
> share it via NFS to the clients. I have got the initial cluster built on
> Centos 7 using pacemaker. I generally followed RH docs to build it so I
> ended up with the simple GFS2 cluster and pacemaker managing fencing and
> floating VIP resource.

Interesting, what fencing solution did you use?

Fabio

> 
> Now I'm wondering about the NFS. RedHat documentation is a bit
> conflicting or rather unclear in some places and I found quite few
> manuals on the internet about similar configuration and generally some
> of them suggest to mount the NFS share on the clients with nolock option
> RH docs mention local flock and I got confused about what supposed to be
> where. Of course I don't know if my understanding is correct but the
> reason to "disable" NFS locking is because GFS2 is already doing it
> anyway via DLM so there is no need for NFS to do same thing what
> eventually mean that I will have some sort of double locking mechanism
> in place. So first question is where I suppose to setup locks or rather
> no locks and how the export should look like ?
> 
> Second thing is I was thinking about going a step forward and use NFS4
> for the exports. However from what I have read about NFS4 it does
> locking by default and there is no way to disable them. Does that mean
> NFS4 is not suitable in this case at all ?
> 
> That's all for now.
> 
> I appreciate your help.
> 
> Thank you.
> TH
> 
> 


From mij at irwan.name  Thu Apr 23 13:05:57 2015
From: mij at irwan.name (Mohd Irwan Jamaluddin)
Date: Thu, 23 Apr 2015 21:05:57 +0800
Subject: [Linux-cluster] GFS2 over NFS4
In-Reply-To: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>
References: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>
Message-ID: <CANpTbaWZrZP-Qp16fRP_wkoWgrxnm5JXtQsx5gqfjFAysLqdSQ@mail.gmail.com>

On Thu, Apr 23, 2015 at 6:11 PM, Thorvald Hallvardsson <
thorvald.hallvardsson at gmail.com> wrote:

> Hi guys,
>
> I need some help and answers related to share GFS2 file system over NFS. I
> have read the RH documentation but still some things are a bit unclear to
> me.
>
> First of all I need to build a POC for the shared storage cluster which
> initially will contain 3 nodes in the storage cluster. This is all going to
> run as a VM environment on Hyper-V. Generally the idea is to share virtual
> VHDX across 3 nodes, put LVM and GFS2 on top of it and then share it via
> NFS to the clients. I have got the initial cluster built on Centos 7 using
> pacemaker. I generally followed RH docs to build it so I ended up with the
> simple GFS2 cluster and pacemaker managing fencing and floating VIP
> resource.
>
> Now I'm wondering about the NFS. RedHat documentation is a bit conflicting
> or rather unclear in some places and I found quite few manuals on the
> internet about similar configuration and generally some of them suggest to
> mount the NFS share on the clients with nolock option RH docs mention local
> flock and I got confused about what supposed to be where. Of course I don't
> know if my understanding is correct but the reason to "disable" NFS locking
> is because GFS2 is already doing it anyway via DLM so there is no need for
> NFS to do same thing what eventually mean that I will have some sort of
> double locking mechanism in place. So first question is where I suppose to
> setup locks or rather no locks and how the export should look like ?
>
> Second thing is I was thinking about going a step forward and use NFS4 for
> the exports. However from what I have read about NFS4 it does locking by
> default and there is no way to disable them. Does that mean NFS4 is not
> suitable in this case at all ?
>
> That's all for now.
>
> I appreciate your help.
>
>
This is the latest KB regarding combination of GFS + NFS that I know of,

Does Red Hat recommend exporting GFS or GFS2 with NFS or Samba in a RHEL
Resilient Storage cluster, and how should I configure it if I do?
https://access.redhat.com/solutions/20327
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150423/956f55ef/attachment.htm>

From thorvald.hallvardsson at gmail.com  Thu Apr 23 14:16:46 2015
From: thorvald.hallvardsson at gmail.com (Thorvald Hallvardsson)
Date: Thu, 23 Apr 2015 15:16:46 +0100
Subject: [Linux-cluster] GFS2 over NFS4
In-Reply-To: <CANpTbaWZrZP-Qp16fRP_wkoWgrxnm5JXtQsx5gqfjFAysLqdSQ@mail.gmail.com>
References: <CADAqjcsfFPHpUSTZwWLNMzptPxDOcBde1qW9iiqX1GEbOJQZ1A@mail.gmail.com>
	<CANpTbaWZrZP-Qp16fRP_wkoWgrxnm5JXtQsx5gqfjFAysLqdSQ@mail.gmail.com>
Message-ID: <CADAqjcvq8V2x9Mx=Jzj=qbmywnoaj=VkAdVPRMHABTgZE3-=0A@mail.gmail.com>

Hi guys,

@Fabio I have just realised that I have no fencing device at all as STONITH
is set to false however some of my resources are set to fence when fail :/.
There is really no choice for Hyper-V unless I will compile my own version
of libvirt :(.

@Mohd this is what I actually trying to use. I managed to find out that
localflocks needs to be used to mount GFS2 on the exporting nodes and
basically my cluster is configured meeting all requirements in that
document. So generally the idea is a bit complex to be honest. I'm going to
have multiple nodes with shared VHDX mounted on each node on the cluster.
However each share will be allocated to separate VIP and each node will
export different resources. Resources are going to be linked to the IPs so
by doing that all nodes in the cluster will be utilised and at the same
time each node from the cluster would be able to take over all resources.

Maybe someone has different ideas ?

Regards,
TH

On 23 April 2015 at 14:05, Mohd Irwan Jamaluddin <mij at irwan.name> wrote:

> On Thu, Apr 23, 2015 at 6:11 PM, Thorvald Hallvardsson <
> thorvald.hallvardsson at gmail.com> wrote:
>
>> Hi guys,
>>
>> I need some help and answers related to share GFS2 file system over NFS.
>> I have read the RH documentation but still some things are a bit unclear to
>> me.
>>
>> First of all I need to build a POC for the shared storage cluster which
>> initially will contain 3 nodes in the storage cluster. This is all going to
>> run as a VM environment on Hyper-V. Generally the idea is to share virtual
>> VHDX across 3 nodes, put LVM and GFS2 on top of it and then share it via
>> NFS to the clients. I have got the initial cluster built on Centos 7 using
>> pacemaker. I generally followed RH docs to build it so I ended up with the
>> simple GFS2 cluster and pacemaker managing fencing and floating VIP
>> resource.
>>
>> Now I'm wondering about the NFS. RedHat documentation is a bit
>> conflicting or rather unclear in some places and I found quite few manuals
>> on the internet about similar configuration and generally some of them
>> suggest to mount the NFS share on the clients with nolock option RH docs
>> mention local flock and I got confused about what supposed to be where. Of
>> course I don't know if my understanding is correct but the reason to
>> "disable" NFS locking is because GFS2 is already doing it anyway via DLM so
>> there is no need for NFS to do same thing what eventually mean that I will
>> have some sort of double locking mechanism in place. So first question is
>> where I suppose to setup locks or rather no locks and how the export should
>> look like ?
>>
>> Second thing is I was thinking about going a step forward and use NFS4
>> for the exports. However from what I have read about NFS4 it does locking
>> by default and there is no way to disable them. Does that mean NFS4 is not
>> suitable in this case at all ?
>>
>> That's all for now.
>>
>> I appreciate your help.
>>
>>
> This is the latest KB regarding combination of GFS + NFS that I know of,
>
> Does Red Hat recommend exporting GFS or GFS2 with NFS or Samba in a RHEL
> Resilient Storage cluster, and how should I configure it if I do?
> https://access.redhat.com/solutions/20327
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150423/cfccc5d9/attachment.htm>

From jashokda at cisco.com  Fri Apr 24 12:12:05 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Fri, 24 Apr 2015 17:42:05 +0530
Subject: [Linux-cluster] Working of a two-node cluster
Message-ID: <553A3315.2050508@cisco.com>

Hi

I am using a two node cluster using RHEL 6.5. I have a very fundamental 
question.

For the two node cluster to work , Is it mandatory that both the nodes 
are "online" and communicating with each other ?

What i can see is that if there is communication failure between them 
then either both the nodes are fenced or the cluster gets into a 
"stopped" state (Seen from output of clustat command).

Apologies if my questions are naive. I am just starting to work with 
RHEL cluster add-on.

Thanks
Jatin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150424/109883e7/attachment.htm>

From emi2fast at gmail.com  Fri Apr 24 12:31:02 2015
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 24 Apr 2015 14:31:02 +0200
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <553A3315.2050508@cisco.com>
References: <553A3315.2050508@cisco.com>
Message-ID: <CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>

please share your cluster config, maybe in this way someone can help you.

2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
> Hi
>
> I am using a two node cluster using RHEL 6.5. I have a very fundamental
> question.
>
> For the two node cluster to work , Is it mandatory that both the nodes are
> "online" and communicating with each other ?
>
> What i can see is that if there is communication failure between them then
> either both the nodes are fenced or the cluster gets into a "stopped" state
> (Seen from output of clustat command).
>
> Apologies if my questions are naive. I am just starting to work with RHEL
> cluster add-on.
>
> Thanks
> Jatin
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
esta es mi vida e me la vivo hasta que dios quiera


From jashokda at cisco.com  Fri Apr 24 12:53:16 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Fri, 24 Apr 2015 18:23:16 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
Message-ID: <553A3CBC.2050909@cisco.com>

Here is my cluster.conf file

************************
<?xml version="1.0"?>
<cluster config_version="4" name="****">
         <clusternodes>
                 <clusternode name="node-103" nodeid="1">
                         <fence>
                                 <method name="Method01">
                                         <device name="node-103"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="node-105" nodeid="2">
                         <fence>
                                 <method name="Method02">
                                         <device name="node-105"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman expected_votes="1" two_node="1"/>
         <fencedevices>
                 <fencedevice agent="fence_ipmilan" auth="password" 
ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" 
passwd="*****" privlvl="ADMINISTRATOR"/>
                 <fencedevice agent="fence_ipmilan" auth="password" 
ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" 
passwd="******" privlvl="ADMINISTRATOR"/>
         </fencedevices>
         <fence_daemon post_join_delay="120"/>
         <rm>
                 <resources>
                         <netfs export="/test" force_unmount="1" 
fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
                         <ip address="x.x.x.x" sleeptime="5"/>
                         <script file="/xxx/xxx/xxx/xxx/xx.sh" 
name="xxxx"/>
                 </resources>
                 <failoverdomains>
                         <failoverdomain name="Failover01" 
nofailback="1" ordered="1">
                                 <failoverdomainnode name="node-103" 
priority="1"/>
                                 <failoverdomainnode name="node-105" 
priority="2"/>
                         </failoverdomain>
                 </failoverdomains>
                 <service domain="Failover01" name="Service01" 
recovery="relocate">
                         <ip ref="x.x.x.x"/>
                         <netfs ref="test123"/>
                         <script ref="xxxx"/>
                 </service>
         </rm>
</cluster>

On 4/24/2015 6:01 PM, emmanuel segura wrote:
> please share your cluster config, maybe in this way someone can help you.
>
> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>> Hi
>>
>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>> question.
>>
>> For the two node cluster to work , Is it mandatory that both the nodes are
>> "online" and communicating with each other ?
>>
>> What i can see is that if there is communication failure between them then
>> either both the nodes are fenced or the cluster gets into a "stopped" state
>> (Seen from output of clustat command).
>>
>> Apologies if my questions are naive. I am just starting to work with RHEL
>> cluster add-on.
>>
>> Thanks
>> Jatin
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>


From emi2fast at gmail.com  Fri Apr 24 13:34:09 2015
From: emi2fast at gmail.com (emmanuel segura)
Date: Fri, 24 Apr 2015 15:34:09 +0200
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <553A3CBC.2050909@cisco.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
	<553A3CBC.2050909@cisco.com>
Message-ID: <CAE7pJ3AdjWRd+Ew9XjTO-qjuahdU_u+QAp0nzWs+dKYo1a-c8A@mail.gmail.com>

you delay parameter in one your fence device, you need to notice the
node with the delay will win the fencing race

The delay used here should usually be at least 5 seconds or more, to
give one node enough time to complete the fencing operation before the
other node begins. This may need to be adjusted based on the actual
amount of time it takes the fence action to complete.

2015-04-24 14:53 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
> Here is my cluster.conf file
>
> ************************
> <?xml version="1.0"?>
> <cluster config_version="4" name="****">
>         <clusternodes>
>                 <clusternode name="node-103" nodeid="1">
>                         <fence>
>                                 <method name="Method01">
>                                         <device name="node-103"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node-105" nodeid="2">
>                         <fence>
>                                 <method name="Method02">
>                                         <device name="node-105"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
> privlvl="ADMINISTRATOR"/>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
> privlvl="ADMINISTRATOR"/>
>         </fencedevices>
>         <fence_daemon post_join_delay="120"/>
>         <rm>
>                 <resources>
>                         <netfs export="/test" force_unmount="1" fstype="nfs"
> host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>                         <ip address="x.x.x.x" sleeptime="5"/>
>                         <script file="/xxx/xxx/xxx/xxx/xx.sh" name="xxxx"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="Failover01" nofailback="1"
> ordered="1">
>                                 <failoverdomainnode name="node-103"
> priority="1"/>
>                                 <failoverdomainnode name="node-105"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service domain="Failover01" name="Service01"
> recovery="relocate">
>                         <ip ref="x.x.x.x"/>
>                         <netfs ref="test123"/>
>                         <script ref="xxxx"/>
>                 </service>
>         </rm>
> </cluster>
>
>
> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>>
>> please share your cluster config, maybe in this way someone can help you.
>>
>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>>>
>>> Hi
>>>
>>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>>> question.
>>>
>>> For the two node cluster to work , Is it mandatory that both the nodes
>>> are
>>> "online" and communicating with each other ?
>>>
>>> What i can see is that if there is communication failure between them
>>> then
>>> either both the nodes are fenced or the cluster gets into a "stopped"
>>> state
>>> (Seen from output of clustat command).
>>>
>>> Apologies if my questions are naive. I am just starting to work with RHEL
>>> cluster add-on.
>>>
>>> Thanks
>>> Jatin
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
esta es mi vida e me la vivo hasta que dios quiera


From vijaykakkars at gmail.com  Fri Apr 24 13:36:44 2015
From: vijaykakkars at gmail.com (Vijay Kakkar)
Date: Fri, 24 Apr 2015 19:06:44 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <553A3CBC.2050909@cisco.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
	<553A3CBC.2050909@cisco.com>
Message-ID: <CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>

You may need to delay the fencing ( delay=seconds ) or use quorum disk if
delaying the fencing doesn't help.

On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> wrote:

> Here is my cluster.conf file
>
> ************************
> <?xml version="1.0"?>
> <cluster config_version="4" name="****">
>         <clusternodes>
>                 <clusternode name="node-103" nodeid="1">
>                         <fence>
>                                 <method name="Method01">
>                                         <device name="node-103"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node-105" nodeid="2">
>                         <fence>
>                                 <method name="Method02">
>                                         <device name="node-105"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
> privlvl="ADMINISTRATOR"/>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
> privlvl="ADMINISTRATOR"/>
>         </fencedevices>
>         <fence_daemon post_join_delay="120"/>
>         <rm>
>                 <resources>
>                         <netfs export="/test" force_unmount="1"
> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>                         <ip address="x.x.x.x" sleeptime="5"/>
>                         <script file="/xxx/xxx/xxx/xxx/xx.sh" name="xxxx"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="Failover01" nofailback="1"
> ordered="1">
>                                 <failoverdomainnode name="node-103"
> priority="1"/>
>                                 <failoverdomainnode name="node-105"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service domain="Failover01" name="Service01"
> recovery="relocate">
>                         <ip ref="x.x.x.x"/>
>                         <netfs ref="test123"/>
>                         <script ref="xxxx"/>
>                 </service>
>         </rm>
> </cluster>
>
>
> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>
>> please share your cluster config, maybe in this way someone can help you.
>>
>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>>
>>> Hi
>>>
>>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>>> question.
>>>
>>> For the two node cluster to work , Is it mandatory that both the nodes
>>> are
>>> "online" and communicating with each other ?
>>>
>>> What i can see is that if there is communication failure between them
>>> then
>>> either both the nodes are fenced or the cluster gets into a "stopped"
>>> state
>>> (Seen from output of clustat command).
>>>
>>> Apologies if my questions are naive. I am just starting to work with RHEL
>>> cluster add-on.
>>>
>>> Thanks
>>> Jatin
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Cheers

*Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}*

Techgrills Systems Pvt. Ltd.
011-46521313 | +919999103657
http://www.techgrills.com
http://lnkd.in/bnj2VUU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150424/ff68c644/attachment.htm>

From jashokda at cisco.com  Mon Apr 27 05:49:45 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Mon, 27 Apr 2015 11:19:45 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>
References: <553A3315.2050508@cisco.com>	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>	<553A3CBC.2050909@cisco.com>
	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>
Message-ID: <553DCDF9.7080308@cisco.com>

Ok , i tried with delay but it has not helped. I guess i have to try 
using quorum disk now.

Thanks
Jatin

On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
> You may need to delay the fencing ( delay=seconds ) or use quorum disk 
> if delaying the fencing doesn't help.
>
> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com 
> <mailto:jashokda at cisco.com>> wrote:
>
>     Here is my cluster.conf file
>
>     ************************
>     <?xml version="1.0"?>
>     <cluster config_version="4" name="****">
>             <clusternodes>
>                     <clusternode name="node-103" nodeid="1">
>                             <fence>
>                                     <method name="Method01">
>                                             <device name="node-103"/>
>                                     </method>
>                             </fence>
>                     </clusternode>
>                     <clusternode name="node-105" nodeid="2">
>                             <fence>
>                                     <method name="Method02">
>                                             <device name="node-105"/>
>                                     </method>
>                             </fence>
>                     </clusternode>
>             </clusternodes>
>             <cman expected_votes="1" two_node="1"/>
>             <fencedevices>
>                     <fencedevice agent="fence_ipmilan" auth="password"
>     ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103"
>     passwd="*****" privlvl="ADMINISTRATOR"/>
>                     <fencedevice agent="fence_ipmilan" auth="password"
>     ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105"
>     passwd="******" privlvl="ADMINISTRATOR"/>
>             </fencedevices>
>             <fence_daemon post_join_delay="120"/>
>             <rm>
>                     <resources>
>                             <netfs export="/test" force_unmount="1"
>     fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test"
>     name="test123"/>
>                             <ip address="x.x.x.x" sleeptime="5"/>
>                             <script file="/xxx/xxx/xxx/xxx/xx.sh"
>     name="xxxx"/>
>                     </resources>
>                     <failoverdomains>
>                             <failoverdomain name="Failover01"
>     nofailback="1" ordered="1">
>                                     <failoverdomainnode
>     name="node-103" priority="1"/>
>                                     <failoverdomainnode
>     name="node-105" priority="2"/>
>                             </failoverdomain>
>                     </failoverdomains>
>                     <service domain="Failover01" name="Service01"
>     recovery="relocate">
>                             <ip ref="x.x.x.x"/>
>                             <netfs ref="test123"/>
>                             <script ref="xxxx"/>
>                     </service>
>             </rm>
>     </cluster>
>
>
>     On 4/24/2015 6:01 PM, emmanuel segura wrote:
>
>         please share your cluster config, maybe in this way someone
>         can help you.
>
>         2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com
>         <mailto:jashokda at cisco.com>>:
>
>             Hi
>
>             I am using a two node cluster using RHEL 6.5. I have a
>             very fundamental
>             question.
>
>             For the two node cluster to work , Is it mandatory that
>             both the nodes are
>             "online" and communicating with each other ?
>
>             What i can see is that if there is communication failure
>             between them then
>             either both the nodes are fenced or the cluster gets into
>             a "stopped" state
>             (Seen from output of clustat command).
>
>             Apologies if my questions are naive. I am just starting to
>             work with RHEL
>             cluster add-on.
>
>             Thanks
>             Jatin
>
>             --
>             Linux-cluster mailing list
>             Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>             https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
>     -- 
>     Linux-cluster mailing list
>     Linux-cluster at redhat.com <mailto:Linux-cluster at redhat.com>
>     https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> -- 
> Cheers
>
> *Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}*
>
> Techgrills Systems Pvt. Ltd.
> 011-46521313 | +919999103657
> http://www.techgrills.com
> http://lnkd.in/bnj2VUU
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/fb6bed97/attachment.htm>

From emi2fast at gmail.com  Mon Apr 27 06:01:33 2015
From: emi2fast at gmail.com (emmanuel segura)
Date: Mon, 27 Apr 2015 08:01:33 +0200
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <553DCDF9.7080308@cisco.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
	<553A3CBC.2050909@cisco.com>
	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>
	<553DCDF9.7080308@cisco.com>
Message-ID: <CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>

did you restarted the cluster after added the delay parameter?

2015-04-27 7:49 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
> Ok , i tried with delay but it has not helped. I guess i have to try using
> quorum disk now.
>
> Thanks
> Jatin
>
> On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
>
> You may need to delay the fencing ( delay=seconds ) or use quorum disk if
> delaying the fencing doesn't help.
>
> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> wrote:
>>
>> Here is my cluster.conf file
>>
>> ************************
>> <?xml version="1.0"?>
>> <cluster config_version="4" name="****">
>>         <clusternodes>
>>                 <clusternode name="node-103" nodeid="1">
>>                         <fence>
>>                                 <method name="Method01">
>>                                         <device name="node-103"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node-105" nodeid="2">
>>                         <fence>
>>                                 <method name="Method02">
>>                                         <device name="node-105"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman expected_votes="1" two_node="1"/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
>> privlvl="ADMINISTRATOR"/>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
>> privlvl="ADMINISTRATOR"/>
>>         </fencedevices>
>>         <fence_daemon post_join_delay="120"/>
>>         <rm>
>>                 <resources>
>>                         <netfs export="/test" force_unmount="1"
>> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>>                         <ip address="x.x.x.x" sleeptime="5"/>
>>                         <script file="/xxx/xxx/xxx/xxx/xx.sh"
>> name="xxxx"/>
>>                 </resources>
>>                 <failoverdomains>
>>                         <failoverdomain name="Failover01" nofailback="1"
>> ordered="1">
>>                                 <failoverdomainnode name="node-103"
>> priority="1"/>
>>                                 <failoverdomainnode name="node-105"
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <service domain="Failover01" name="Service01"
>> recovery="relocate">
>>                         <ip ref="x.x.x.x"/>
>>                         <netfs ref="test123"/>
>>                         <script ref="xxxx"/>
>>                 </service>
>>         </rm>
>> </cluster>
>>
>>
>> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>>>
>>> please share your cluster config, maybe in this way someone can help you.
>>>
>>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>>>>
>>>> Hi
>>>>
>>>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>>>> question.
>>>>
>>>> For the two node cluster to work , Is it mandatory that both the nodes
>>>> are
>>>> "online" and communicating with each other ?
>>>>
>>>> What i can see is that if there is communication failure between them
>>>> then
>>>> either both the nodes are fenced or the cluster gets into a "stopped"
>>>> state
>>>> (Seen from output of clustat command).
>>>>
>>>> Apologies if my questions are naive. I am just starting to work with
>>>> RHEL
>>>> cluster add-on.
>>>>
>>>> Thanks
>>>> Jatin
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Cheers
>
> Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}
>
> Techgrills Systems Pvt. Ltd.
> 011-46521313 | +919999103657
> http://www.techgrills.com
> http://lnkd.in/bnj2VUU
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
esta es mi vida e me la vivo hasta que dios quiera


From jashokda at cisco.com  Mon Apr 27 06:08:11 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Mon, 27 Apr 2015 11:38:11 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>
References: <553A3315.2050508@cisco.com>	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>	<553A3CBC.2050909@cisco.com>	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>	<553DCDF9.7080308@cisco.com>
	<CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>
Message-ID: <553DD24B.2070407@cisco.com>

Yes , I did restart it.

On 4/27/2015 11:31 AM, emmanuel segura wrote:
> did you restarted the cluster after added the delay parameter?
>
> 2015-04-27 7:49 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>> Ok , i tried with delay but it has not helped. I guess i have to try using
>> quorum disk now.
>>
>> Thanks
>> Jatin
>>
>> On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
>>
>> You may need to delay the fencing ( delay=seconds ) or use quorum disk if
>> delaying the fencing doesn't help.
>>
>> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> wrote:
>>> Here is my cluster.conf file
>>>
>>> ************************
>>> <?xml version="1.0"?>
>>> <cluster config_version="4" name="****">
>>>          <clusternodes>
>>>                  <clusternode name="node-103" nodeid="1">
>>>                          <fence>
>>>                                  <method name="Method01">
>>>                                          <device name="node-103"/>
>>>                                  </method>
>>>                          </fence>
>>>                  </clusternode>
>>>                  <clusternode name="node-105" nodeid="2">
>>>                          <fence>
>>>                                  <method name="Method02">
>>>                                          <device name="node-105"/>
>>>                                  </method>
>>>                          </fence>
>>>                  </clusternode>
>>>          </clusternodes>
>>>          <cman expected_votes="1" two_node="1"/>
>>>          <fencedevices>
>>>                  <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
>>> privlvl="ADMINISTRATOR"/>
>>>                  <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
>>> privlvl="ADMINISTRATOR"/>
>>>          </fencedevices>
>>>          <fence_daemon post_join_delay="120"/>
>>>          <rm>
>>>                  <resources>
>>>                          <netfs export="/test" force_unmount="1"
>>> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>>>                          <ip address="x.x.x.x" sleeptime="5"/>
>>>                          <script file="/xxx/xxx/xxx/xxx/xx.sh"
>>> name="xxxx"/>
>>>                  </resources>
>>>                  <failoverdomains>
>>>                          <failoverdomain name="Failover01" nofailback="1"
>>> ordered="1">
>>>                                  <failoverdomainnode name="node-103"
>>> priority="1"/>
>>>                                  <failoverdomainnode name="node-105"
>>> priority="2"/>
>>>                          </failoverdomain>
>>>                  </failoverdomains>
>>>                  <service domain="Failover01" name="Service01"
>>> recovery="relocate">
>>>                          <ip ref="x.x.x.x"/>
>>>                          <netfs ref="test123"/>
>>>                          <script ref="xxxx"/>
>>>                  </service>
>>>          </rm>
>>> </cluster>
>>>
>>>
>>> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>>>> please share your cluster config, maybe in this way someone can help you.
>>>>
>>>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com>:
>>>>> Hi
>>>>>
>>>>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>>>>> question.
>>>>>
>>>>> For the two node cluster to work , Is it mandatory that both the nodes
>>>>> are
>>>>> "online" and communicating with each other ?
>>>>>
>>>>> What i can see is that if there is communication failure between them
>>>>> then
>>>>> either both the nodes are fenced or the cluster gets into a "stopped"
>>>>> state
>>>>> (Seen from output of clustat command).
>>>>>
>>>>> Apologies if my questions are naive. I am just starting to work with
>>>>> RHEL
>>>>> cluster add-on.
>>>>>
>>>>> Thanks
>>>>> Jatin
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> --
>> Cheers
>>
>> Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}
>>
>> Techgrills Systems Pvt. Ltd.
>> 011-46521313 | +919999103657
>> http://www.techgrills.com
>> http://lnkd.in/bnj2VUU
>>
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/59788620/attachment.htm>

From vijaykakkars at gmail.com  Mon Apr 27 06:44:46 2015
From: vijaykakkars at gmail.com (Vijay Kakkar)
Date: Mon, 27 Apr 2015 12:14:46 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <553DD24B.2070407@cisco.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
	<553A3CBC.2050909@cisco.com>
	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>
	<553DCDF9.7080308@cisco.com>
	<CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>
	<553DD24B.2070407@cisco.com>
Message-ID: <CAOAQVaGW+dYB8_Pv3PkkHohaKJfKAid5HRhZt07qoKjy+kTsJQ@mail.gmail.com>

You should look for qdisk now.I hope this will be helpful.

On Mon, Apr 27, 2015 at 11:38 AM, Jatin Davey <jashokda at cisco.com> wrote:

>  Yes , I did restart it.
>
>
> On 4/27/2015 11:31 AM, emmanuel segura wrote:
>
> did you restarted the cluster after added the delay parameter?
>
> 2015-04-27 7:49 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>
>  Ok , i tried with delay but it has not helped. I guess i have to try using
> quorum disk now.
>
> Thanks
> Jatin
>
> On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
>
> You may need to delay the fencing ( delay=seconds ) or use quorum disk if
> delaying the fencing doesn't help.
>
> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com> wrote:
>
>  Here is my cluster.conf file
>
> ************************
> <?xml version="1.0"?>
> <cluster config_version="4" name="****">
>         <clusternodes>
>                 <clusternode name="node-103" nodeid="1">
>                         <fence>
>                                 <method name="Method01">
>                                         <device name="node-103"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>                 <clusternode name="node-105" nodeid="2">
>                         <fence>
>                                 <method name="Method02">
>                                         <device name="node-105"/>
>                                 </method>
>                         </fence>
>                 </clusternode>
>         </clusternodes>
>         <cman expected_votes="1" two_node="1"/>
>         <fencedevices>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
> privlvl="ADMINISTRATOR"/>
>                 <fencedevice agent="fence_ipmilan" auth="password"
> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
> privlvl="ADMINISTRATOR"/>
>         </fencedevices>
>         <fence_daemon post_join_delay="120"/>
>         <rm>
>                 <resources>
>                         <netfs export="/test" force_unmount="1"
> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>                         <ip address="x.x.x.x" sleeptime="5"/>
>                         <script file="/xxx/xxx/xxx/xxx/xx.sh"
> name="xxxx"/>
>                 </resources>
>                 <failoverdomains>
>                         <failoverdomain name="Failover01" nofailback="1"
> ordered="1">
>                                 <failoverdomainnode name="node-103"
> priority="1"/>
>                                 <failoverdomainnode name="node-105"
> priority="2"/>
>                         </failoverdomain>
>                 </failoverdomains>
>                 <service domain="Failover01" name="Service01"
> recovery="relocate">
>                         <ip ref="x.x.x.x"/>
>                         <netfs ref="test123"/>
>                         <script ref="xxxx"/>
>                 </service>
>         </rm>
> </cluster>
>
>
> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>
>  please share your cluster config, maybe in this way someone can help you.
>
> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>
>  Hi
>
> I am using a two node cluster using RHEL 6.5. I have a very fundamental
> question.
>
> For the two node cluster to work , Is it mandatory that both the nodes
> are
> "online" and communicating with each other ?
>
> What i can see is that if there is communication failure between them
> then
> either both the nodes are fenced or the cluster gets into a "stopped"
> state
> (Seen from output of clustat command).
>
> Apologies if my questions are naive. I am just starting to work with
> RHEL
> cluster add-on.
>
> Thanks
> Jatin
>
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>  --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Cheers
>
> Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}
>
> Techgrills Systems Pvt. Ltd.
> 011-46521313 | +919999103657http://www.techgrills.comhttp://lnkd.in/bnj2VUU
>
>
>
>
> --
> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


-- 
Cheers

*Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}*

Techgrills Systems Pvt. Ltd.
011-46521313 | +919999103657
http://www.techgrills.com
http://lnkd.in/bnj2VUU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/1ef1a6a0/attachment.htm>

From vasil.val at gmail.com  Mon Apr 27 07:58:55 2015
From: vasil.val at gmail.com (Vasil Valchev)
Date: Mon, 27 Apr 2015 10:58:55 +0300
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAOAQVaGW+dYB8_Pv3PkkHohaKJfKAid5HRhZt07qoKjy+kTsJQ@mail.gmail.com>
References: <553A3315.2050508@cisco.com>
	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>
	<553A3CBC.2050909@cisco.com>
	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>
	<553DCDF9.7080308@cisco.com>
	<CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>
	<553DD24B.2070407@cisco.com>
	<CAOAQVaGW+dYB8_Pv3PkkHohaKJfKAid5HRhZt07qoKjy+kTsJQ@mail.gmail.com>
Message-ID: <CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com>

Hi,

I would advise you to use quorum disk _only_ as a last resort - it's better
to first get a solid understanding of the clustering solution before adding
additional complexity.
An amazingly thorough and well described tutorial you can find here:
https://alteeve.ca/w/AN!Cluster_Tutorial_2

Especially useful are the first chapters - the theory.
What I suspect is happening in your case is that your cluster communication
and fencing are over the same network, which is not fault tolerant.
So what happens if this network fails? Your 2 nodes can't see each other,
so they send fence requests, but the fence devices are unreachable too, so
those requests fail.
They are retried a few times I think, but if all fail, the fence agent
returns failed and your cluster is stuck in "recovering" or stopped state.
Other times the network outage is shorter and the fence succeeds, resulting
in both nodes going down - this is solved with the delay parameter.
The first issue is architectural one, it is the expected behavior of the
cluster to stop (or "freeze") all resources if it can't guarantee the state
of all members.

Read the article above it's really very useful.

Cheers!

On Mon, Apr 27, 2015 at 9:44 AM, Vijay Kakkar <vijaykakkars at gmail.com>
wrote:

> You should look for qdisk now.I hope this will be helpful.
>
> On Mon, Apr 27, 2015 at 11:38 AM, Jatin Davey <jashokda at cisco.com> wrote:
>
>>  Yes , I did restart it.
>>
>>
>> On 4/27/2015 11:31 AM, emmanuel segura wrote:
>>
>> did you restarted the cluster after added the delay parameter?
>>
>> 2015-04-27 7:49 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>>
>>  Ok , i tried with delay but it has not helped. I guess i have to try using
>> quorum disk now.
>>
>> Thanks
>> Jatin
>>
>> On 4/24/2015 7:06 PM, Vijay Kakkar wrote:
>>
>> You may need to delay the fencing ( delay=seconds ) or use quorum disk if
>> delaying the fencing doesn't help.
>>
>> On Fri, Apr 24, 2015 at 6:23 PM, Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com> wrote:
>>
>>  Here is my cluster.conf file
>>
>> ************************
>> <?xml version="1.0"?>
>> <cluster config_version="4" name="****">
>>         <clusternodes>
>>                 <clusternode name="node-103" nodeid="1">
>>                         <fence>
>>                                 <method name="Method01">
>>                                         <device name="node-103"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>                 <clusternode name="node-105" nodeid="2">
>>                         <fence>
>>                                 <method name="Method02">
>>                                         <device name="node-105"/>
>>                                 </method>
>>                         </fence>
>>                 </clusternode>
>>         </clusternodes>
>>         <cman expected_votes="1" two_node="1"/>
>>         <fencedevices>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-103" passwd="*****"
>> privlvl="ADMINISTRATOR"/>
>>                 <fencedevice agent="fence_ipmilan" auth="password"
>> ipaddr="x.x.x.x" lanplus="on" login="admin" name="node-105" passwd="******"
>> privlvl="ADMINISTRATOR"/>
>>         </fencedevices>
>>         <fence_daemon post_join_delay="120"/>
>>         <rm>
>>                 <resources>
>>                         <netfs export="/test" force_unmount="1"
>> fstype="nfs" host="x.x.x.x" mountpoint="/test/test/test" name="test123"/>
>>                         <ip address="x.x.x.x" sleeptime="5"/>
>>                         <script file="/xxx/xxx/xxx/xxx/xx.sh"
>> name="xxxx"/>
>>                 </resources>
>>                 <failoverdomains>
>>                         <failoverdomain name="Failover01" nofailback="1"
>> ordered="1">
>>                                 <failoverdomainnode name="node-103"
>> priority="1"/>
>>                                 <failoverdomainnode name="node-105"
>> priority="2"/>
>>                         </failoverdomain>
>>                 </failoverdomains>
>>                 <service domain="Failover01" name="Service01"
>> recovery="relocate">
>>                         <ip ref="x.x.x.x"/>
>>                         <netfs ref="test123"/>
>>                         <script ref="xxxx"/>
>>                 </service>
>>         </rm>
>> </cluster>
>>
>>
>> On 4/24/2015 6:01 PM, emmanuel segura wrote:
>>
>>  please share your cluster config, maybe in this way someone can help you.
>>
>> 2015-04-24 14:12 GMT+02:00 Jatin Davey <jashokda at cisco.com> <jashokda at cisco.com>:
>>
>>  Hi
>>
>> I am using a two node cluster using RHEL 6.5. I have a very fundamental
>> question.
>>
>> For the two node cluster to work , Is it mandatory that both the nodes
>> are
>> "online" and communicating with each other ?
>>
>> What i can see is that if there is communication failure between them
>> then
>> either both the nodes are fenced or the cluster gets into a "stopped"
>> state
>> (Seen from output of clustat command).
>>
>> Apologies if my questions are naive. I am just starting to work with
>> RHEL
>> cluster add-on.
>>
>> Thanks
>> Jatin
>>
>> --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>   --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>  --
>> Cheers
>>
>> Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}
>>
>> Techgrills Systems Pvt. Ltd.
>> 011-46521313 | +919999103657http://www.techgrills.comhttp://lnkd.in/bnj2VUU
>>
>>
>>
>>
>> --
>> Linux-cluster mailing listLinux-cluster at redhat.comhttps://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
>
> --
> Cheers
>
> *Vijay Kakkar - RHC{E,SS,VA,DS,A,I,X}*
>
> Techgrills Systems Pvt. Ltd.
> 011-46521313 | +919999103657
> http://www.techgrills.com
> http://lnkd.in/bnj2VUU
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/5805bfd3/attachment.htm>

From jashokda at cisco.com  Mon Apr 27 08:18:57 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Mon, 27 Apr 2015 13:48:57 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com>
References: <553A3315.2050508@cisco.com>	<CAE7pJ3DzLen4ze3Yb=ZkFq_XMch0Wcid=FqXj7C5rjaLHCumFQ@mail.gmail.com>	<553A3CBC.2050909@cisco.com>	<CAOAQVaF=Nq8Xzs=hdz0h8wdvLRCa-WqZY2cm+DC_V9d2X=8RUA@mail.gmail.com>	<553DCDF9.7080308@cisco.com>	<CAE7pJ3C3y3fXYm4oG5_mQ+6LJ7qQxCvK1DUorDQzgYrC6cuLAg@mail.gmail.com>	<553DD24B.2070407@cisco.com>	<CAOAQVaGW+dYB8_Pv3PkkHohaKJfKAid5HRhZt07qoKjy+kTsJQ@mail.gmail.com>
	<CAFZxf=L6cbpiC-pw1cKLD=PQ5iraOmR2wq1JCgj9HwaZFtoYxQ@mail.gmail.com>
Message-ID: <553DF0F1.7070400@cisco.com>

On 4/27/2015 1:28 PM, Vasil Valchev wrote:
> Hi,
>
> I would advise you to use quorum disk _only_ as a last resort - it's 
> better to first get a solid understanding of the clustering solution 
> before adding additional complexity.
> An amazingly thorough and well described tutorial you can find here: 
> https://alteeve.ca/w/AN!Cluster_Tutorial_2 
> <https://alteeve.ca/w/AN%21Cluster_Tutorial_2>
[Jatin] Thank you very much for sharing this tutorial. I will surely go 
through it and gain more understanding.
>
> Especially useful are the first chapters - the theory.
> What I suspect is happening in your case is that your cluster 
> communication and fencing are over the same network, which is not 
> fault tolerant.
[Jatin]
My cluster communication happens over one network while fencing happens 
over other network. I use two seperate vlans for this purpose. Secondly 
when the cluster communication fails due to network outage then fencing 
happens over the other vlan and both the nodes get fenced.
> So what happens if this network fails? Your 2 nodes can't see each 
> other, so they send fence requests, but the fence devices are 
> unreachable too, so those requests fail.
> They are retried a few times I think, but if all fail, the fence agent 
> returns failed and your cluster is stuck in "recovering" or stopped state.
> Other times the network outage is shorter and the fence succeeds, 
> resulting in both nodes going down - this is solved with the delay 
> parameter.
> The first issue is architectural one, it is the expected behavior of 
> the cluster to stop (or "freeze") all resources if it can't guarantee 
> the state of all members.
>
> Read the article above it's really very useful.
>
> Cheers!

Thanks
Jatin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/a03af828/attachment.htm>

From vasil.val at gmail.com  Mon Apr 27 09:56:00 2015
From: vasil.val at gmail.com (Vasil Valchev)
Date: Mon, 27 Apr 2015 12:56:00 +0300
Subject: [Linux-cluster] Working of a two-node cluster
Message-ID: <CAFZxf=+bOcYAMNRkjjNbHhWFn4973Q72VL1tNbFbYUq7Zu690A@mail.gmail.com>

Hi Jatin,

Yes, you are definitely not using it properly :)
Put delay=20, just on one of the devices. This device will wait for 20 sec
(or 10 if you choose), before sending fence request. By that time, the
device without delay will have sent the request and will have fenced it.
Also you only have to modify cluster.conf on one node. Pay attention that
you need to increment the config number.

You can see the current config version with cman_tool, I don't remember the
exact option - status maybe.
You can first verify the syntax of the file, then tell the cluster to
update it - check google for the exact commands for RHEL 6.

Let me know if you have troubles.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150427/b8912d87/attachment.htm>

From jashokda at cisco.com  Tue Apr 28 10:53:00 2015
From: jashokda at cisco.com (Jatin Davey)
Date: Tue, 28 Apr 2015 16:23:00 +0530
Subject: [Linux-cluster] Working of a two-node cluster
In-Reply-To: <CAFZxf=+bOcYAMNRkjjNbHhWFn4973Q72VL1tNbFbYUq7Zu690A@mail.gmail.com>
References: <CAFZxf=+bOcYAMNRkjjNbHhWFn4973Q72VL1tNbFbYUq7Zu690A@mail.gmail.com>
Message-ID: <553F668C.9000506@cisco.com>

Thanks Vasil.

Adding delay to only one fenced device certainly solved my issue.

Regards,
Jatin

On 4/27/2015 3:26 PM, Vasil Valchev wrote:
> Hi Jatin,
>
> Yes, you are definitely not using it properly :)
> Put delay=20, just on one of the devices. This device will wait for 20 
> sec (or 10 if you choose), before sending fence request. By that time, 
> the device without delay will have sent the request and will have 
> fenced it.
> Also you only have to modify cluster.conf on one node. Pay attention 
> that you need to increment the config number.
>
> You can see the current config version with cman_tool, I don't 
> remember the exact option - status maybe.
> You can first verify the syntax of the file, then tell the cluster to 
> update it - check google for the exact commands for RHEL 6.
>
> Let me know if you have troubles.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150428/2b010bc6/attachment.htm>

From Robert.C.Jacobson at nasa.gov  Wed Apr 29 12:48:37 2015
From: Robert.C.Jacobson at nasa.gov (Robert Jacobson)
Date: Wed, 29 Apr 2015 08:48:37 -0400
Subject: [Linux-cluster] cluster not fencing after filesystem failure
Message-ID: <5540D325.5090205@nasa.gov>


Hi,

I'm having a problem on CentOS 6.5 with a two-node cluster for HA NFS. 
Here's the cluster.conf:  http://pastebin.com/aVAuUDtc

The cluster nodes are VMware guests.  Occasionally the node providing
the NFS service has a problem accessing the disk device (I'm working
with VMware on that...), but long story short -- the kernel shuts down
the XFS filesystem:

Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): metadata I/O
error: block 0x170013a900 ("xlog_iodone") error 5 buf count 65536
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10):
xfs_do_force_shutdown(0x2) called from line 1062 of file
fs/xfs/xfs_log.c.  Return address = 0xffffffffa027f131
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): Log I/O Error
Detected.  Shutting down filesystem
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: nfsd: non-standard errno: 5
Apr 25 02:29:51 sdo-dds-nfsnode2 kernel: XFS (dm-10): Please umount the
filesystem and rectify the problem(s)

rgmanager noticed the filesystem problem (see log at
http://pastebin.com/mPPBP2HY ), and marked "HA_nfs" service in a failed
state.

What I'm confused about is why the fencing is not taking place in the
above scenario.  I'm guessing I have either a misunderstanding or
misconfiguration.
At this point I'd like the other node to fence the failed one and take
over.  Or, the failed node to fence itself.

I've tested fencing from the command line and it works:
fence_vmware_soap --ip 192.168.50.9 --username ddsfence --password
secret -z --action reboot -U  "423d288c-03ff-74bf-9a4f-bf661f8ed87b"

I'd appreciate any help with this.

package versions, if it matters:

rgmanager-3.0.12.1-19.el6.x86_64
cman-3.0.12.1-59.el6_5.2.x86_64

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Robert Jacobson               Robert.C.Jacobson at nasa.gov
Lead System Admin       Solar Dynamics Observatory (SDO)
Bldg 14, E222                             (301) 286-1591 


From vasil.val at gmail.com  Wed Apr 29 15:31:25 2015
From: vasil.val at gmail.com (Vasil Valchev)
Date: Wed, 29 Apr 2015 18:31:25 +0300
Subject: [Linux-cluster] cluster not fencing after filesystem failure
In-Reply-To: <5540D325.5090205@nasa.gov>
References: <5540D325.5090205@nasa.gov>
Message-ID: <CAFZxf=Lhc9Aem+uYeDsGAvh6OFB24CHA6N01wcmqnx8C6zpiCg@mail.gmail.com>

Hi,

You can check in the log for "fenced" messages, if it tries to fence the
node at all. Also for "cman".
Is your cluster hanged after a node failure? That would indicate the
fencing didn't succeed for some reason.

   >I've tested fencing from the command line and it works:
   >fence_vmware_soap --ip 192.168.50.9 --username ddsfence --password
   >secret -z --action reboot -U  "423d288c-03ff-74bf-9a4f-bf661f8ed87b"


You can also test fencing with "fence_node <node-to-be-fenced>" - that way
it is tested with the exact arguments from the cluster.conf and you can see
if it works or not.

BR,
Vasil
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150429/53a326fc/attachment.htm>

From Robert.C.Jacobson at nasa.gov  Wed Apr 29 16:49:07 2015
From: Robert.C.Jacobson at nasa.gov (Robert Jacobson)
Date: Wed, 29 Apr 2015 12:49:07 -0400
Subject: [Linux-cluster] cluster not fencing after filesystem failure
In-Reply-To: <CAFZxf=Lhc9Aem+uYeDsGAvh6OFB24CHA6N01wcmqnx8C6zpiCg@mail.gmail.com>
References: <5540D325.5090205@nasa.gov>
	<CAFZxf=Lhc9Aem+uYeDsGAvh6OFB24CHA6N01wcmqnx8C6zpiCg@mail.gmail.com>
Message-ID: <55410B83.8090807@nasa.gov>


Actually it does look like the failed node was fenced by the other node:

Apr 25 02:51:46 fenced fencing node sdo-dds-nfsnode2.dds.sdo
Apr 25 02:52:39 fenced fence sdo-dds-nfsnode2.dds.sdo success

However, even after fencing, the working node (sdo-dds-nfsnode1) did not
resume the HA_nfs service.  The service was in a failed state:

[root at sdo-dds-nfsnode1 log]# clustat
Cluster Status for ddsnfs @ Sat Apr 25 03:45:11 2015
Member Status: Quorate

 Member Name                                                   ID   Status
 ------ ----                                                   ---- ------
 sdo-dds-nfsnode1.dds.sdo                                          1
Online, Local, rgmanager
 sdo-dds-nfsnode2.dds.sdo                                          2
Online, rgmanager

 Service Name                                         Owner
(Last)                                         State
 ------- ----                                         -----
------                                         -----
 service:HA_nfs                                      
(sdo-dds-nfsnode2.dds.sdo)                           failed


On 2015-04-29 11:31 AM, Vasil Valchev wrote:
> Hi,
>
> You can check in the log for "fenced" messages, if it tries to fence
> the node at all. Also for "cman".
> Is your cluster hanged after a node failure? That would indicate the
> fencing didn't succeed for some reason.
>
>    >I've tested fencing from the command line and it works:
>    >fence_vmware_soap --ip 192.168.50.9 --username ddsfence --password
>    >secret -z --action reboot -U  "423d288c-03ff-74bf-9a4f-bf661f8ed87b"
>
>
> You can also test fencing with "fence_node <node-to-be-fenced>" - that
> way it is tested with the exact arguments from the cluster.conf and
> you can see if it works or not.
>
> BR,
> Vasil
>
>
>


-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Robert Jacobson               Robert.C.Jacobson at nasa.gov
Lead System Admin       Solar Dynamics Observatory (SDO)
Bldg 14, E222                             (301) 286-1591 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150429/7d58792e/attachment.htm>