From jagauthier at gmail.com Tue May 8 11:18:17 2018 From: jagauthier at gmail.com (Jason Gauthier) Date: Tue, 8 May 2018 07:18:17 -0400 Subject: [Linux-cluster] DLM won't (stay) running Message-ID: Greetings, I'm working on a setup of a two-node cluster with shared storage. I've been able to see the storage on both nodes, and appropriate configuration for fencing the bock device. The next step was getting DLM and GFS2 in a clone group to mount the FS on both drives. This is where I am running into trouble. As far as the OS goes, it's debian. I'm using pacemaker, corosync, and crm for cluster management. At the moment, I've removed the gfs2 parts just to try and get dlm working. My current config looks like this: node 1084772368: alpha node 1084772369: beta primitive p_dlm_controld ocf:pacemaker:controld \ op monitor interval=60 timeout=60 \ meta target-role=Started args=-K primitive p_gfs_controld ocf:pacemaker:controld \ params daemon=gfs_controld \ meta target-role=Started primitive stonith_sbd stonith:external/sbd \ params pcmk_delay_max=30 sbd_device="/dev/sdb1" group g_gfs2 p_dlm_controld p_gfs_controld clone cl_gfs2 g_gfs2 \ meta interleave=true target-role=Started property cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.16-94ff4df \ cluster-infrastructure=corosync \ cluster-name=zeta \ last-lrm-refresh=1525523370 \ stonith-enabled=true \ stonith-timeout=20s When a bring the resources up, I get a quick blip in my logs. May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection to node 1084772369 May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection to node 1084772368 This is the same messaging I see when I run dlm manually and then stop it. My challenge here is that I cannot find out what dlm is doing. I've tried adding -K to /etc/default/dlm, but I don't think that file is being respected. I would like to figure out how to increase the verbose output of dlm_controld so I can see why it won't stay running when it's launched through the cluster. I haven't been able to figure out how to pass arguments directly to the a daemon in the primitive config, if it's even possible. Otherwise, I would try to pass -K there. Thanks! Jason From teigland at redhat.com Tue May 8 14:50:21 2018 From: teigland at redhat.com (David Teigland) Date: Tue, 8 May 2018 09:50:21 -0500 Subject: [Linux-cluster] DLM won't (stay) running In-Reply-To: References: Message-ID: <20180508145021.GB3799@redhat.com> On Tue, May 08, 2018 at 07:18:17AM -0400, Jason Gauthier wrote: > node 1084772368: alpha > node 1084772369: beta > primitive p_dlm_controld ocf:pacemaker:controld \ > op monitor interval=60 timeout=60 \ > meta target-role=Started args=-K > primitive p_gfs_controld ocf:pacemaker:controld \ > params daemon=gfs_controld \ > meta target-role=Started > primitive stonith_sbd stonith:external/sbd \ > params pcmk_delay_max=30 sbd_device="/dev/sdb1" > group g_gfs2 p_dlm_controld p_gfs_controld > clone cl_gfs2 g_gfs2 \ > meta interleave=true target-role=Started > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-94ff4df \ > cluster-infrastructure=corosync \ > cluster-name=zeta \ > last-lrm-refresh=1525523370 \ > stonith-enabled=true \ > stonith-timeout=20s > > When a bring the resources up, I get a quick blip in my logs. > May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started > May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection > to node 1084772369 > May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection > to node 1084772368 When you're starting the dlm through pacemaker, be sure that systemd is not also starting it. I don't think pacemaker is happy if dlm_controld is already started. > This is the same messaging I see when I run dlm manually and then stop > it. My challenge here is that I cannot find out what dlm is doing. > I've tried adding -K to /etc/default/dlm, but I don't think that file > is being respected. I would like to figure out how to increase the > verbose output of dlm_controld so I can see why it won't stay running > when it's launched through the cluster. I haven't been able to > figure out how to pass arguments directly to the a daemon in the > primitive config, if it's even possible. Otherwise, I would try to > pass -K there. In /etc/dlm/dlm.conf put log_debug=1 debug_logfile=1 then you should see all the debug info in /var/log/dlm_controld/dlm_controld.log From jagauthier at gmail.com Tue May 8 17:44:18 2018 From: jagauthier at gmail.com (Jason Gauthier) Date: Tue, 8 May 2018 13:44:18 -0400 Subject: [Linux-cluster] DLM won't (stay) running In-Reply-To: <20180508145021.GB3799@redhat.com> References: <20180508145021.GB3799@redhat.com> Message-ID: On Tue, May 8, 2018 at 10:50 AM, David Teigland wrote: > On Tue, May 08, 2018 at 07:18:17AM -0400, Jason Gauthier wrote: >> node 1084772368: alpha >> node 1084772369: beta >> primitive p_dlm_controld ocf:pacemaker:controld \ >> op monitor interval=60 timeout=60 \ >> meta target-role=Started args=-K >> primitive p_gfs_controld ocf:pacemaker:controld \ >> params daemon=gfs_controld \ >> meta target-role=Started >> primitive stonith_sbd stonith:external/sbd \ >> params pcmk_delay_max=30 sbd_device="/dev/sdb1" >> group g_gfs2 p_dlm_controld p_gfs_controld >> clone cl_gfs2 g_gfs2 \ >> meta interleave=true target-role=Started >> property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.16-94ff4df \ >> cluster-infrastructure=corosync \ >> cluster-name=zeta \ >> last-lrm-refresh=1525523370 \ >> stonith-enabled=true \ >> stonith-timeout=20s >> >> When a bring the resources up, I get a quick blip in my logs. >> May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started >> May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection >> to node 1084772369 >> May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection >> to node 1084772368 > > When you're starting the dlm through pacemaker, be sure that systemd is > not also starting it. I don't think pacemaker is happy if dlm_controld > is already started. > Thanks David. dlm is not enabled with systemd at all. >> This is the same messaging I see when I run dlm manually and then stop >> it. My challenge here is that I cannot find out what dlm is doing. >> I've tried adding -K to /etc/default/dlm, but I don't think that file >> is being respected. I would like to figure out how to increase the >> verbose output of dlm_controld so I can see why it won't stay running >> when it's launched through the cluster. I haven't been able to >> figure out how to pass arguments directly to the a daemon in the >> primitive config, if it's even possible. Otherwise, I would try to >> pass -K there. > > In /etc/dlm/dlm.conf put > > log_debug=1 > debug_logfile=1 > > then you should see all the debug info in > /var/log/dlm_controld/dlm_controld.log > I made this change to both my nodes.. and tried to start the resource. I just get the same two lines in messages, and a new log file for dlm_controld.log does not appear. From anprice at redhat.com Wed May 9 10:26:13 2018 From: anprice at redhat.com (Andrew Price) Date: Wed, 9 May 2018 11:26:13 +0100 Subject: [Linux-cluster] DLM won't (stay) running In-Reply-To: References: Message-ID: [linux-cluster@ isn't really used nowadays; CCing users at clusterlabs] On 08/05/18 12:18, Jason Gauthier wrote: > Greetings, > > I'm working on a setup of a two-node cluster with shared storage. > I've been able to see the storage on both nodes, and appropriate > configuration for fencing the bock device. > > The next step was getting DLM and GFS2 in a clone group to mount the > FS on both drives. This is where I am running into trouble. > > As far as the OS goes, it's debian. I'm using pacemaker, corosync, > and crm for cluster management. Is it safe to assume that you're using Debian Wheezy? (The need for gfs_controld disappeared in the 3.3 kernel.) As wheezy goes end-of-life at the end of the month I would suggest upgrading, you will likely find the cluster tools more user friendly and the components more stable. Andy > At the moment, I've removed the gfs2 parts just to try and get dlm working. > > My current config looks like this: > > node 1084772368: alpha > node 1084772369: beta > primitive p_dlm_controld ocf:pacemaker:controld \ > op monitor interval=60 timeout=60 \ > meta target-role=Started args=-K > primitive p_gfs_controld ocf:pacemaker:controld \ > params daemon=gfs_controld \ > meta target-role=Started > primitive stonith_sbd stonith:external/sbd \ > params pcmk_delay_max=30 sbd_device="/dev/sdb1" > group g_gfs2 p_dlm_controld p_gfs_controld > clone cl_gfs2 g_gfs2 \ > meta interleave=true target-role=Started > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-94ff4df \ > cluster-infrastructure=corosync \ > cluster-name=zeta \ > last-lrm-refresh=1525523370 \ > stonith-enabled=true \ > stonith-timeout=20s > > When a bring the resources up, I get a quick blip in my logs. > May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started > May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection > to node 1084772369 > May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection > to node 1084772368 > > > This is the same messaging I see when I run dlm manually and then stop > it. My challenge here is that I cannot find out what dlm is doing. > I've tried adding -K to /etc/default/dlm, but I don't think that file > is being respected. I would like to figure out how to increase the > verbose output of dlm_controld so I can see why it won't stay running > when it's launched through the cluster. I haven't been able to > figure out how to pass arguments directly to the a daemon in the > primitive config, if it's even possible. Otherwise, I would try to > pass -K there. > > Thanks! > > Jason > From jagauthier at gmail.com Wed May 9 10:51:03 2018 From: jagauthier at gmail.com (Jason Gauthier) Date: Wed, 9 May 2018 06:51:03 -0400 Subject: [Linux-cluster] DLM won't (stay) running In-Reply-To: References: Message-ID: On Wed, May 9, 2018 at 6:26 AM, Andrew Price wrote: > [linux-cluster@ isn't really used nowadays; CCing users at clusterlabs] > > On 08/05/18 12:18, Jason Gauthier wrote: >> >> Greetings, >> >> I'm working on a setup of a two-node cluster with shared storage. >> I've been able to see the storage on both nodes, and appropriate >> configuration for fencing the bock device. >> >> The next step was getting DLM and GFS2 in a clone group to mount the >> FS on both drives. This is where I am running into trouble. >> >> As far as the OS goes, it's debian. I'm using pacemaker, corosync, >> and crm for cluster management. > > > Is it safe to assume that you're using Debian Wheezy? (The need for > gfs_controld disappeared in the 3.3 kernel.) As wheezy goes end-of-life at > the end of the month I would suggest upgrading, you will likely find the > cluster tools more user friendly and the components more stable. I am using stretch, which was the challenge at first. I couldn't find any information about it. Even as new as Jessie contains gfs2_controld. I could not figure out how to make it work. But, yeah, that is now removed.. because it works fine without it. And the good news is: I messed around with this for quite some time last night and finally got everything to come up reliably on both nodes. Even reboots,and simultaneous reboots. So, I am pleased! Time for the next part which is building some VMs. Thanks for the help! >> At the moment, I've removed the gfs2 parts just to try and get dlm >> working. >> >> My current config looks like this: >> >> node 1084772368: alpha >> node 1084772369: beta >> primitive p_dlm_controld ocf:pacemaker:controld \ >> op monitor interval=60 timeout=60 \ >> meta target-role=Started args=-K >> primitive p_gfs_controld ocf:pacemaker:controld \ >> params daemon=gfs_controld \ >> meta target-role=Started >> primitive stonith_sbd stonith:external/sbd \ >> params pcmk_delay_max=30 sbd_device="/dev/sdb1" >> group g_gfs2 p_dlm_controld p_gfs_controld >> clone cl_gfs2 g_gfs2 \ >> meta interleave=true target-role=Started >> property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.16-94ff4df \ >> cluster-infrastructure=corosync \ >> cluster-name=zeta \ >> last-lrm-refresh=1525523370 \ >> stonith-enabled=true \ >> stonith-timeout=20s >> >> When a bring the resources up, I get a quick blip in my logs. >> May 8 07:13:58 beta dlm_controld[9425]: 253556 dlm_controld 4.0.7 started >> May 8 07:14:00 beta kernel: [253558.641658] dlm: closing connection >> to node 1084772369 >> May 8 07:14:00 beta kernel: [253558.641764] dlm: closing connection >> to node 1084772368 >> >> >> This is the same messaging I see when I run dlm manually and then stop >> it. My challenge here is that I cannot find out what dlm is doing. >> I've tried adding -K to /etc/default/dlm, but I don't think that file >> is being respected. I would like to figure out how to increase the >> verbose output of dlm_controld so I can see why it won't stay running >> when it's launched through the cluster. I haven't been able to >> figure out how to pass arguments directly to the a daemon in the >> primitive config, if it's even possible. Otherwise, I would try to >> pass -K there. >> >> Thanks! >> >> Jason >> > From anprice at redhat.com Thu May 24 10:53:52 2018 From: anprice at redhat.com (Andrew Price) Date: Thu, 24 May 2018 11:53:52 +0100 Subject: [Linux-cluster] gfs2-utils 3.2.0 released Message-ID: <5c40f090-f6f8-2310-d68f-6727cb082a6b@redhat.com> Hi all, I am happy to announce the 3.2.0 release of gfs2-utils. This is an important release adding support for new on-disk features introduced in the 4.16 kernel. In fact it is required when building against 4.16 and later kernel headers due to poor assumptions made by earlier gfs2-utils relating to structure size changes. Building earlier gfs2-utils against 4.16 headers will result in test suite failures. (Thanks to Valentin Vidic for reporting these issues.) This release adds basic support for new gfs2 on-disk features: * Resource group header CRCs * "Next resource group" pointers in resource groups * Journal log header block CRCs * Journal log header timestamp fields * Statfs accounting fields in journal log headers Future releases will build on this work to take advantage of these new features, particularly for improving checking and performance. Other notable changes: * mkfs.gfs2 now scales down the journal size to make better use of small devices by default * Better detection of bad device topology * fsck.gfs2 no longer accepts conflicting -p, -n and -y options * Saving of symlinks in gfs2_edit savemeta has been fixed * Fixes for issues caught by static analysis and new compiler warnings * New test cases in the testsuite * Various minor code cleanups and improvements See below for a complete list of changes. The source tarball is available from: https://releases.pagure.org/gfs2-utils/gfs2-utils-3.2.0.tar.gz Please report bugs against the gfs2-utils component of Fedora rawhide: https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=gfs2-utils&version=rawhide Regards, Andy Changes since version 3.1.10: Andrew Price (66): gfs2_grow: Disable rgrp alignment when dev topology is unsuitable mkfs.gfs2: Free unnecessary cached pages, disable readahead mkfs.gfs2: Fix resource group alignment issue libgfs2: Issue one write per rgrp when creating them libgfs2: Switch gfs2_dinode_out to use a char buffer libgfs2: Switch gfs2_log_header_out to use a char buffer gfs2-utils: Change instances of "gfs2_fsck" to "fsck.gfs2" gfs2_convert: Fix fgets return value warning fsck.gfs2: Fix snprintf truncation warning fsck.gfs2: Fix unchecked return value warning gfs2_grow: Fix unchecked ftruncate return value warning gfs2_grow: Remove unnecessary nesting in fix_rindex() gfs2_edit savemeta: Fix up saving of dinodes/symlinks gfs2_edit savemeta: Use size_t for saved structure lengths fsck.gfs2: Make -p, -n and -y conflicting options gfs2_edit: Print offsets of indirect pointers gfs2-utils configure: Check for rg_skip libgfs2: Add rgrp_skip support mkfs.gfs2: Pull place_journals() out of place_rgrps() mkfs.gfs2: Set the rg_skip field in new rgrps gfs2-utils configure: Check for rg_data0, rg_data and rg_bitbytes libgfs2: Add support for rg_data0, rg_data and rg_bitbytes mkfs.gfs2: Set the rg_data0, rg_data and rg_bitbytes fields libgfs2: Add support for rg_crc Add basic support for v2 log headers mkfs.gfs2: Scale down journal size for smaller devices gfs2-utils: Remove make-tarball.sh glocktop: Remove a non-existent flag from the usage string fsck.gfs2: Don't check lh_crc for older filesystems libgfs2: Remove unused lock* fields from gfs2_sbd libgfs2: Remove sb_addr from gfs2_sbd libgfs2: Plug an alignment hole in gfs2_sbd libgfs2: Plug an alignment hole in gfs2_buffer_head libgfs2: Plug an alignment hole in gfs2_inode libgfs2: Remove gfs2_meta_header_out_bh() libgfs2: Don't pass an extlen to block_map where not required libgfs2: Don't use a buffer_head in gfs2_meta_header_in libgfs2: Don't use buffer_heads in gfs2_sb_in libgfs2: Don't use buffer_heads in gfs2_rgrp_in libgfs2: Remove gfs2_rgrp_out_bh libgfs2: Don't use buffer_heads in gfs2_dinode_in libgfs2: Remove gfs2_dinode_out_bh libgfs2: Don't use buffer_heads in gfs2_leaf_{in,out} libgfs2: Don't use buffer_heads in gfs2_log_header_in libgfs2: Remove gfs2_log_header_out_bh libgfs2: Don't use buffer_heads in gfs2_log_descriptor_{in,out} libgfs2: Don't use buffer_heads in gfs2_quota_change_{in,out} libgfs2: Fix two unused variable warnings mkfs.gfs2: Silence an integer overflow warning libgfs2: Fix a thinko in write_journal() gfs2-utils tests: Add a fsck.gfs2 test for rebuilding journals gfs2_edit: Fix null pointer deref in dump_journal() libgfs2: Remove dead code from gfs2_rgrp_read() fsck.gfs2: Avoid int overflow in find_next_rgrp_dist gfs2_edit: Avoid potential int overflow in find_journal_block() gfs2_edit: Avoid a potential int overflow in dump_journal gfs2-utils: Avoid some more potential in overflows libgfs2: Fix a memory leak in lgfs2_build_jindex glocktop: Fix memory leak in show_inode gfs2l: Remove some dead code from the interpreter loop glocktop: Fix new -Wformat-overflow warnings mkfs.gfs2: Zero blocks in alignment gaps libgfs2: Use sizeof for 'reserved' fields in ondisk.c tests: Add rg test scripts to EXTRA_DIST gfs2-utils: Use cluster-devel as contact address gfs2-utils: Update translation template Valentin Vidic (5): gfs2_trace: update for python3 gfs2_lockcapture: update for python3 Fix spelling errors spotted by lintian mkfs.gfs2: fix tests call to gfs_max_blocks gfs2-utils tests: Fix testsuite cleanup