From daniel.dehennin at baby-gnu.org  Mon Feb 15 09:20:57 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 10:20:57 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
Message-ID: <87oabijpmu.fsf@hati.baby-gnu.org>

Hello,

We run some troubles since several days on our GFS2 (log attached):

- we ran the FS for some times without troubles (since 2014-11-03)

- the FS was grown from 3To to 4To near 6 month ago

- it seems to happen only on one node ?nebula3?

- I run an FSCK when just fencing the node was not sufficient (2 crashes
  the same day)

The nodes run on Ubuntu Trusty Thar up to date.

Do you have any idea?

Regards.

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: gfs2.log
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/98d934c6/attachment.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/98d934c6/attachment.sig>

From swhiteho at redhat.com  Mon Feb 15 09:45:01 2016
From: swhiteho at redhat.com (Steven Whitehouse)
Date: Mon, 15 Feb 2016 09:45:01 +0000
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87oabijpmu.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org>
Message-ID: <56C19E1D.80007@redhat.com>

Hi,

On 15/02/16 09:20, Daniel Dehennin wrote:
> Hello,
>
> We run some troubles since several days on our GFS2 (log attached):
>
> - we ran the FS for some times without troubles (since 2014-11-03)
>
> - the FS was grown from 3To to 4To near 6 month ago
>
> - it seems to happen only on one node ?nebula3?
>
> - I run an FSCK when just fencing the node was not sufficient (2 crashes
>    the same day)
>
> The nodes run on Ubuntu Trusty Thar up to date.
>
> Do you have any idea?
>
> Regards.
>
>
That looks like it is trying to free a block that is already marked as 
being free. fsck should fix that. What version of gfs2-utils are you using?

Steve.




From daniel.dehennin at baby-gnu.org  Mon Feb 15 10:02:42 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 11:02:42 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <56C19E1D.80007@redhat.com> (Steven Whitehouse's message of "Mon, 
	15 Feb 2016 09:45:01 +0000")
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C19E1D.80007@redhat.com>
Message-ID: <87k2m6jnp9.fsf@hati.baby-gnu.org>

Steven Whitehouse <swhiteho at redhat.com> writes:

> That looks like it is trying to free a block that is already marked as
> being free. fsck should fix that. What version of gfs2-utils are you
> using?

Hello,

We are using 3.1.6-0ubuntu1.

Running an fsck is quite expensive for us, 4 hours with the shared FS
unusable.

I forgot to say that it stores qcow2 images, so there should not be
concurrency on the file system except on some directories to
create/access sub directories:

    <GFS2 mount point>/<DIRECTORY OF RUNNING VMs>/<VM ID>/<QCOW2 images>

Only the <DIRECTORY OF RUNNING VMs> should have concurrent write
accesses, everything under <VM ID> is accessed only by one node at a
time, except for monitoring which is read only.

So ?looks like it is trying to free a block that is already marked as
being free? looks strange.

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/e71799dc/attachment.sig>

From daniel.dehennin at baby-gnu.org  Mon Feb 15 13:19:27 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 14:19:27 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87k2m6jnp9.fsf@hati.baby-gnu.org> (Daniel Dehennin's message of
	"Mon, 15 Feb 2016 11:02:42 +0100")
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C19E1D.80007@redhat.com>
	<87k2m6jnp9.fsf@hati.baby-gnu.org>
Message-ID: <87fuwujelc.fsf@hati.baby-gnu.org>

Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:


[...]

> We are using 3.1.6-0ubuntu1.
>
> Running an fsck is quite expensive for us, 4 hours with the shared FS
> unusable.
>
> I forgot to say that it stores qcow2 images, so there should not be
> concurrency on the file system except on some directories to
> create/access sub directories:
>
>     <GFS2 mount point>/<DIRECTORY OF RUNNING VMs>/<VM ID>/<QCOW2 images>
>
> Only the <DIRECTORY OF RUNNING VMs> should have concurrent write
> accesses, everything under <VM ID> is accessed only by one node at a
> time, except for monitoring which is read only.
>
> So ?looks like it is trying to free a block that is already marked as
> being free? looks strange.

Now the kernel gave me a warning, if it could help:

Feb 15 14:13:07 nebula3 kernel: [16423.261927] ------------[ cut here ]------------
Feb 15 14:13:07 nebula3 kernel: [16423.261943] WARNING: CPU: 8 PID: 4410 at /build/linux-OTIHGI/linux-3.13.0/mm/page_alloc.c:1604 get_page_from_freelist+0x924/0x930()
Feb 15 14:13:07 nebula3 kernel: [16423.261945] Modules linked in: vhost_net vhost macvtap macvlan gfs2 dlm sctp configfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables dm_round_robin openvswitch gre vxlan ip_tunnel nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache bonding x86_pkg_temp_thermal intel_powerclamp ipmi_devintf gpio_ich coretemp dcdbas kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd dm_multipath joydev scsi_dh mei_me shpchp mei sb_edac ipmi_si edac_core lpc_ich acpi_power_meter mac_hid wmi iTCO_wdt iTCO_vendor_support ses enclosure hid_generic qla2xxx usbhid hid ahci scsi_transport_fc libahci bnx2x tg3 megaraid_sas ptp scsi_tgt pps_core mdio libcrc32c
Feb 15 14:13:07 nebula3 kernel: [16423.262017] CPU: 8 PID: 4410 Comm: rm Not tainted 3.13.0-78-generic #122-Ubuntu
Feb 15 14:13:07 nebula3 kernel: [16423.262019] Hardware name: Dell Inc. PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
Feb 15 14:13:07 nebula3 kernel: [16423.262022]  0000000000000009 ffff882e5f9f7820 ffffffff81725768 0000000000000000
Feb 15 14:13:07 nebula3 kernel: [16423.262028]  ffff882e5f9f7858 ffffffff810678bd 0000000000000004 00000000000035de
Feb 15 14:13:07 nebula3 kernel: [16423.262033]  0000000000000001 ffff88187fffbf00 0000000000000000 ffff882e5f9f7868
Feb 15 14:13:07 nebula3 kernel: [16423.262037] Call Trace:
Feb 15 14:13:07 nebula3 kernel: [16423.262046]  [<ffffffff81725768>] dump_stack+0x45/0x56
Feb 15 14:13:07 nebula3 kernel: [16423.262052]  [<ffffffff810678bd>] warn_slowpath_common+0x7d/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262056]  [<ffffffff8106799a>] warn_slowpath_null+0x1a/0x20
Feb 15 14:13:07 nebula3 kernel: [16423.262060]  [<ffffffff81159134>] get_page_from_freelist+0x924/0x930
Feb 15 14:13:07 nebula3 kernel: [16423.262091]  [<ffffffff8101289e>] ? __switch_to+0x3fe/0x4d0
Feb 15 14:13:07 nebula3 kernel: [16423.262096]  [<ffffffff811592c4>] __alloc_pages_nodemask+0x184/0xb80
Feb 15 14:13:07 nebula3 kernel: [16423.262102]  [<ffffffff8114f86e>] ? find_get_page+0x1e/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262111]  [<ffffffff8114fe00>] ? find_lock_page+0x30/0x70
Feb 15 14:13:07 nebula3 kernel: [16423.262115]  [<ffffffff81150404>] ? find_or_create_page+0x34/0x90
Feb 15 14:13:07 nebula3 kernel: [16423.262125]  [<ffffffff8136aa2e>] ? radix_tree_lookup_slot+0xe/0x10
Feb 15 14:13:07 nebula3 kernel: [16423.262134]  [<ffffffff81198153>] alloc_pages_current+0xa3/0x160
Feb 15 14:13:07 nebula3 kernel: [16423.262144]  [<ffffffff8115432e>] __get_free_pages+0xe/0x50
Feb 15 14:13:07 nebula3 kernel: [16423.262157]  [<ffffffff8117125e>] kmalloc_order_trace+0x2e/0xa0
Feb 15 14:13:07 nebula3 kernel: [16423.262170]  [<ffffffff810ab0f5>] ? wake_up_bit+0x25/0x30
Feb 15 14:13:07 nebula3 kernel: [16423.262177]  [<ffffffff811a3301>] __kmalloc+0x211/0x230
Feb 15 14:13:07 nebula3 kernel: [16423.262192]  [<ffffffffa05c15f6>] gfs2_rlist_alloc+0x26/0x70 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262199]  [<ffffffffa059cd5d>] recursive_scan+0x29d/0x6a0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262206]  [<ffffffffa059cf2c>] recursive_scan+0x46c/0x6a0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262217]  [<ffffffffa05bb4f5>] ? gfs2_quota_hold+0x175/0x1f0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262224]  [<ffffffffa059d25a>] trunc_dealloc+0xfa/0x120 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262232]  [<ffffffffa05a898e>] ? gfs2_glock_wait+0x3e/0x80 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262240]  [<ffffffffa05aa190>] ? gfs2_glock_nq+0x280/0x430 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262247]  [<ffffffffa059eef0>] gfs2_file_dealloc+0x10/0x20 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262257]  [<ffffffffa05c1db3>] gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262276]  [<ffffffffa05c1c13>] ? gfs2_evict_inode+0x113/0x3e0 [gfs2]
Feb 15 14:13:07 nebula3 kernel: [16423.262286]  [<ffffffff811d9a40>] evict+0xb0/0x1b0
Feb 15 14:13:07 nebula3 kernel: [16423.262290]  [<ffffffff811da255>] iput+0xf5/0x180
Feb 15 14:13:07 nebula3 kernel: [16423.262296]  [<ffffffff811cebae>] do_unlinkat+0x18e/0x2b0
Feb 15 14:13:07 nebula3 kernel: [16423.262305]  [<ffffffff811bbc06>] ? filp_close+0x56/0x70
Feb 15 14:13:07 nebula3 kernel: [16423.262310]  [<ffffffff811cfadb>] SyS_unlinkat+0x1b/0x40
Feb 15 14:13:07 nebula3 kernel: [16423.262315]  [<ffffffff8173635d>] system_call_fastpath+0x1a/0x1f
Feb 15 14:13:07 nebula3 kernel: [16423.262318] ---[ end trace 346ccba5c58117dc ]---

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/1ab67b94/attachment.sig>

From rpeterso at redhat.com  Mon Feb 15 13:35:53 2016
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 15 Feb 2016 08:35:53 -0500 (EST)
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87fuwujelc.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C19E1D.80007@redhat.com>
	<87k2m6jnp9.fsf@hati.baby-gnu.org>
	<87fuwujelc.fsf@hati.baby-gnu.org>
Message-ID: <1777830627.22898504.1455543353023.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:
> 
(snip)
> Now the kernel gave me a warning, if it could help:
> 
> Feb 15 14:13:07 nebula3 kernel: [16423.261927] ------------[ cut here
> ]------------
> Feb 15 14:13:07 nebula3 kernel: [16423.261943] WARNING: CPU: 8 PID: 4410 at
> /build/linux-OTIHGI/linux-3.13.0/mm/page_alloc.c:1604
> get_page_from_freelist+0x924/0x930()
> Feb 15 14:13:07 nebula3 kernel: [16423.261945] Modules linked in: vhost_net
> vhost macvtap macvlan gfs2 dlm sctp configfs ip6table_filter ip6_tables
> iptable_filter ip_tables x_tables dm_round_robin openvswitch gre vxlan
> ip_tunnel nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache bonding
> x86_pkg_temp_thermal intel_powerclamp ipmi_devintf gpio_ich coretemp dcdbas
> kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw
> gf128mul glue_helper ablk_helper cryptd dm_multipath joydev scsi_dh mei_me
> shpchp mei sb_edac ipmi_si edac_core lpc_ich acpi_power_meter mac_hid wmi
> iTCO_wdt iTCO_vendor_support ses enclosure hid_generic qla2xxx usbhid hid
> ahci scsi_transport_fc libahci bnx2x tg3 megaraid_sas ptp scsi_tgt pps_core
> mdio libcrc32c
> Feb 15 14:13:07 nebula3 kernel: [16423.262017] CPU: 8 PID: 4410 Comm: rm Not
> tainted 3.13.0-78-generic #122-Ubuntu
> Feb 15 14:13:07 nebula3 kernel: [16423.262019] Hardware name: Dell Inc.
> PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
> Feb 15 14:13:07 nebula3 kernel: [16423.262022]  0000000000000009
> ffff882e5f9f7820 ffffffff81725768 0000000000000000
> Feb 15 14:13:07 nebula3 kernel: [16423.262028]  ffff882e5f9f7858
> ffffffff810678bd 0000000000000004 00000000000035de
> Feb 15 14:13:07 nebula3 kernel: [16423.262033]  0000000000000001
> ffff88187fffbf00 0000000000000000 ffff882e5f9f7868
> Feb 15 14:13:07 nebula3 kernel: [16423.262037] Call Trace:
> Feb 15 14:13:07 nebula3 kernel: [16423.262046]  [<ffffffff81725768>]
> dump_stack+0x45/0x56
> Feb 15 14:13:07 nebula3 kernel: [16423.262052]  [<ffffffff810678bd>]
> warn_slowpath_common+0x7d/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262056]  [<ffffffff8106799a>]
> warn_slowpath_null+0x1a/0x20
> Feb 15 14:13:07 nebula3 kernel: [16423.262060]  [<ffffffff81159134>]
> get_page_from_freelist+0x924/0x930
> Feb 15 14:13:07 nebula3 kernel: [16423.262091]  [<ffffffff8101289e>] ?
> __switch_to+0x3fe/0x4d0
> Feb 15 14:13:07 nebula3 kernel: [16423.262096]  [<ffffffff811592c4>]
> __alloc_pages_nodemask+0x184/0xb80
> Feb 15 14:13:07 nebula3 kernel: [16423.262102]  [<ffffffff8114f86e>] ?
> find_get_page+0x1e/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262111]  [<ffffffff8114fe00>] ?
> find_lock_page+0x30/0x70
> Feb 15 14:13:07 nebula3 kernel: [16423.262115]  [<ffffffff81150404>] ?
> find_or_create_page+0x34/0x90
> Feb 15 14:13:07 nebula3 kernel: [16423.262125]  [<ffffffff8136aa2e>] ?
> radix_tree_lookup_slot+0xe/0x10
> Feb 15 14:13:07 nebula3 kernel: [16423.262134]  [<ffffffff81198153>]
> alloc_pages_current+0xa3/0x160
> Feb 15 14:13:07 nebula3 kernel: [16423.262144]  [<ffffffff8115432e>]
> __get_free_pages+0xe/0x50
> Feb 15 14:13:07 nebula3 kernel: [16423.262157]  [<ffffffff8117125e>]
> kmalloc_order_trace+0x2e/0xa0
> Feb 15 14:13:07 nebula3 kernel: [16423.262170]  [<ffffffff810ab0f5>] ?
> wake_up_bit+0x25/0x30
> Feb 15 14:13:07 nebula3 kernel: [16423.262177]  [<ffffffff811a3301>]
> __kmalloc+0x211/0x230
> Feb 15 14:13:07 nebula3 kernel: [16423.262192]  [<ffffffffa05c15f6>]
> gfs2_rlist_alloc+0x26/0x70 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262199]  [<ffffffffa059cd5d>]
> recursive_scan+0x29d/0x6a0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262206]  [<ffffffffa059cf2c>]
> recursive_scan+0x46c/0x6a0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262217]  [<ffffffffa05bb4f5>] ?
> gfs2_quota_hold+0x175/0x1f0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262224]  [<ffffffffa059d25a>]
> trunc_dealloc+0xfa/0x120 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262232]  [<ffffffffa05a898e>] ?
> gfs2_glock_wait+0x3e/0x80 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262240]  [<ffffffffa05aa190>] ?
> gfs2_glock_nq+0x280/0x430 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262247]  [<ffffffffa059eef0>]
> gfs2_file_dealloc+0x10/0x20 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262257]  [<ffffffffa05c1db3>]
> gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262276]  [<ffffffffa05c1c13>] ?
> gfs2_evict_inode+0x113/0x3e0 [gfs2]
> Feb 15 14:13:07 nebula3 kernel: [16423.262286]  [<ffffffff811d9a40>]
> evict+0xb0/0x1b0
> Feb 15 14:13:07 nebula3 kernel: [16423.262290]  [<ffffffff811da255>]
> iput+0xf5/0x180
> Feb 15 14:13:07 nebula3 kernel: [16423.262296]  [<ffffffff811cebae>]
> do_unlinkat+0x18e/0x2b0
> Feb 15 14:13:07 nebula3 kernel: [16423.262305]  [<ffffffff811bbc06>] ?
> filp_close+0x56/0x70
> Feb 15 14:13:07 nebula3 kernel: [16423.262310]  [<ffffffff811cfadb>]
> SyS_unlinkat+0x1b/0x40
> Feb 15 14:13:07 nebula3 kernel: [16423.262315]  [<ffffffff8173635d>]
> system_call_fastpath+0x1a/0x1f
> Feb 15 14:13:07 nebula3 kernel: [16423.262318] ---[ end trace
> 346ccba5c58117dc ]---
> 
> Regards.
> --
> Daniel Dehennin
> R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi,

This call trace may safely be ignored. This is a known problem and is
harmless. This is documented as a very old bug record. I don't know if
it's accessible externally, so you may not be able to read it:

https://bugzilla.redhat.com/show_bug.cgi?id=790188

It does not relate to the previous error you posted.

Regards,

Bob Peterson
Red Hat File Systems



From rpeterso at redhat.com  Mon Feb 15 13:42:45 2016
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 15 Feb 2016 08:42:45 -0500 (EST)
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87oabijpmu.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org>
Message-ID: <1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hello,
> 
> We run some troubles since several days on our GFS2 (log attached):
> 
> - we ran the FS for some times without troubles (since 2014-11-03)
> 
> - the FS was grown from 3To to 4To near 6 month ago
> 
> - it seems to happen only on one node ?nebula3?
> 
> - I run an FSCK when just fencing the node was not sufficient (2 crashes
>   the same day)
> 
> The nodes run on Ubuntu Trusty Thar up to date.
> 
> Do you have any idea?
> 
> Regards.
> 
> --
> Daniel Dehennin
> R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Hi Daniel,

As Steve mentioned, this means it tried to free a block that was already free.
The problem might be due to timing issues with regard to changing blocks
from "unlinked" state to "free" state. I have a number of patches required
to fix this, but some of them are not even in the upstream kernel yet.
And there is no guarantee this is the cause or solution for your problem.

As for fixing the file system: Newer versions of fsck.gfs2 should be able to
fix the file system. If it doesn't fix the file system, perhaps I could
get a copy of your file system metadata (via "gfs2_edit savemeta") and I
can see where it's failing. There are some known problems with fsck.gfs2 not
being able to correctly repair file systems that have been grown with gfs2_grow.
I've got fixes for that too, but it is all experimental code and none have gone
upstream yet. Your metadata might help me test it. :)

Regards,

Bob Peterson
Red Hat File Systems



From daniel.dehennin at baby-gnu.org  Mon Feb 15 14:22:55 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 15:22:55 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <1777830627.22898504.1455543353023.JavaMail.zimbra@redhat.com>
	(Bob Peterson's message of "Mon, 15 Feb 2016 08:35:53 -0500 (EST)")
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C19E1D.80007@redhat.com>
	<87k2m6jnp9.fsf@hati.baby-gnu.org> <87fuwujelc.fsf@hati.baby-gnu.org>
	<1777830627.22898504.1455543353023.JavaMail.zimbra@redhat.com>
Message-ID: <87bn7ijbnk.fsf@hati.baby-gnu.org>

Bob Peterson <rpeterso at redhat.com> writes:

> ----- Original Message -----
>> Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:
>> 
> (snip)
>> Now the kernel gave me a warning, if it could help:

[...]

> This call trace may safely be ignored. This is a known problem and is
> harmless. This is documented as a very old bug record. I don't know if
> it's accessible externally, so you may not be able to read it:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=790188
>
> It does not relate to the previous error you posted.

Thanks for reassuring me, I can't read the bug but I'm trusing you ;-)

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/a5cdce89/attachment.sig>

From daniel.dehennin at baby-gnu.org  Mon Feb 15 14:26:22 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 15:26:22 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	(Bob Peterson's message of "Mon, 15 Feb 2016 08:42:45 -0500 (EST)")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
Message-ID: <877fi6jbht.fsf@hati.baby-gnu.org>

Bob Peterson <rpeterso at redhat.com> writes:


[...]

> As Steve mentioned, this means it tried to free a block that was already free.
> The problem might be due to timing issues with regard to changing blocks
> from "unlinked" state to "free" state. I have a number of patches required
> to fix this, but some of them are not even in the upstream kernel yet.
> And there is no guarantee this is the cause or solution for your
> problem.

Ok, I understand.

> As for fixing the file system: Newer versions of fsck.gfs2 should be able to
> fix the file system. If it doesn't fix the file system, perhaps I could
> get a copy of your file system metadata (via "gfs2_edit savemeta") and I
> can see where it's failing. There are some known problems with fsck.gfs2 not
> being able to correctly repair file systems that have been grown with gfs2_grow.
> I've got fixes for that too, but it is all experimental code and none have gone
> upstream yet. Your metadata might help me test it. :)

It's running but looks like it will take a long time and produce a huge
file.

But the start of the ?gfs2_edit savemeta? command looks stange to me:

    There are 1073479680 blocks of 4096 bytes in the destination device.
    Reading resource groups...Done. File system size: 1023.734M

Is it saying my FS is 1TB instead of the real 4TB?

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/1403e56d/attachment.sig>

From rpeterso at redhat.com  Mon Feb 15 14:39:32 2016
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 15 Feb 2016 09:39:32 -0500 (EST)
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <877fi6jbht.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org>
Message-ID: <847023142.23036741.1455547172124.JavaMail.zimbra@redhat.com>

----- Original Message -----
> It's running but looks like it will take a long time and produce a huge
> file.
> 
> But the start of the ?gfs2_edit savemeta? command looks stange to me:
> 
>     There are 1073479680 blocks of 4096 bytes in the destination device.
>     Reading resource groups...Done. File system size: 1023.734M
> 
> Is it saying my FS is 1TB instead of the real 4TB?

Hi Daniel,

It sounds like the resulting metadata file will be too big to email, so
you'll need to find a suitable server to put it on, so I can grab it.

No, 4TB device looks about right: 1073479680 * 4096.

Regards,

Bob Peterson
Red Hat File Systems



From anprice at redhat.com  Mon Feb 15 14:43:31 2016
From: anprice at redhat.com (Andrew Price)
Date: Mon, 15 Feb 2016 14:43:31 +0000
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <877fi6jbht.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org>
Message-ID: <56C1E413.20503@redhat.com>

On 15/02/16 14:26, Daniel Dehennin wrote:
> But the start of the ?gfs2_edit savemeta? command looks stange to me:
>
>      There are 1073479680 blocks of 4096 bytes in the destination device.
>      Reading resource groups...Done. File system size: 1023.734M
>
> Is it saying my FS is 1TB instead of the real 4TB?

I suspect you're using an older version of gfs2-utils that doesn't 
contain this patch:

https://git.fedorahosted.org/cgit/gfs2-utils.git/commit/?id=45b761f6

It's only a cosmetic "multiply by the fs block size before printing the 
value" step that's missing, so nothing to worry about.

The patch was included in gfs2-utils 3.1.7.

Cheers,
Andy



From daniel.dehennin at baby-gnu.org  Mon Feb 15 14:56:37 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 15:56:37 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <56C1E413.20503@redhat.com> (Andrew Price's message of "Mon, 15
	Feb 2016 14:43:31 +0000")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
Message-ID: <87y4amhviy.fsf@hati.baby-gnu.org>

Andrew Price <anprice at redhat.com> writes:


[...]

> I suspect you're using an older version of gfs2-utils that doesn't
> contain this patch:
>
> https://git.fedorahosted.org/cgit/gfs2-utils.git/commit/?id=45b761f6
>
> It's only a cosmetic "multiply by the fs block size before printing
> the value" step that's missing, so nothing to worry about.
>
> The patch was included in gfs2-utils 3.1.7.

Thanks, I'm using 3.1.6.

Tonight I'll build the version 3.1.8 from Git[1] and run ?fsck.gfs2 -p? on the fs.

Regards.

Footnotes: 
[1]  https://git.fedorahosted.org/cgit/gfs2-utils.git/tag/?h=3.1.8

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/aa54e0e2/attachment.sig>

From daniel.dehennin at baby-gnu.org  Mon Feb 15 14:59:52 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 15 Feb 2016 15:59:52 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <847023142.23036741.1455547172124.JavaMail.zimbra@redhat.com>
	(Bob Peterson's message of "Mon, 15 Feb 2016 09:39:32 -0500 (EST)")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org>
	<847023142.23036741.1455547172124.JavaMail.zimbra@redhat.com>
Message-ID: <87twlahvdj.fsf@hati.baby-gnu.org>

Bob Peterson <rpeterso at redhat.com> writes:


[...]

> It sounds like the resulting metadata file will be too big to email, so
> you'll need to find a suitable server to put it on, so I can grab it.

Sure, for now I'm at ?3604476 inodes processed, 272407 blocks saved (0%)
processed? with a meta file of 127Mo.

My fs i 3TB used, I hope the meta file will not full my /home (2.5GB) ;-)

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160215/68e92b9e/attachment.sig>

From daniel.dehennin at baby-gnu.org  Fri Feb 19 09:51:31 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Fri, 19 Feb 2016 10:51:31 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87y4amhviy.fsf@hati.baby-gnu.org> (Daniel Dehennin's message of
	"Mon, 15 Feb 2016 15:56:37 +0100")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org>
Message-ID: <87lh6hhvto.fsf@hati.baby-gnu.org>

Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:

> Thanks, I'm using 3.1.6.
>
> Tonight I'll build the version 3.1.8 from Git[1] and run ?fsck.gfs2 -p? on the fs.

Hello,

I preferred to do the fsck on the filesystem, two times[1], instead of
the ?gfs2_edit savemeta?:

1. ?fsck.gfs2 -p <BLOCK DEVICE>? was quick
2. ?fsck.gfs2 -f -p <BLOCK DEVICE>? took 4 hours

The cluster was bringed up after and everything was working fine until
yesterday:

    Feb 18 19:13:22 nebula3 kernel: [293848.682606] GFS2: buf_blk = 0x2089 old_state=0, new_state=0
    Feb 18 19:13:22 nebula3 kernel: [293848.682612] GFS2: rgrp=0xc0c5667 bi_start=0x0
    Feb 18 19:13:22 nebula3 kernel: [293848.682614] GFS2: bi_offset=0x80 bi_len=0xf80
    Feb 18 19:13:22 nebula3 kernel: [293848.682619] CPU: 6 PID: 7057 Comm: kworker/6:8 Tainted: G        W     3.13.0-78-generic #122-Ubuntu
    Feb 18 19:13:22 nebula3 kernel: [293848.682621] Hardware name: Dell Inc. PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
    Feb 18 19:13:22 nebula3 kernel: [293848.682637] Workqueue: delete_workqueue delete_work_func [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682640]  000000000c0c7705 ffff8811256c59d8 ffffffff81725768 000000000c0c76f6
    Feb 18 19:13:22 nebula3 kernel: [293848.682648]  ffff8811256c5a30 ffffffffa05bebbf ffff880f5ffe9200 00000000a05c5977
    Feb 18 19:13:22 nebula3 kernel: [293848.682653]  ffff880f1ee574c8 0000000000002089 ffff882e8c622000 0000000000000010
    Feb 18 19:13:22 nebula3 kernel: [293848.682658] Call Trace:
    Feb 18 19:13:22 nebula3 kernel: [293848.682668]  [<ffffffff81725768>] dump_stack+0x45/0x56
    Feb 18 19:13:22 nebula3 kernel: [293848.682681]  [<ffffffffa05bebbf>] rgblk_free+0x1ff/0x230 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682693]  [<ffffffffa05c0f34>] __gfs2_free_blocks+0x34/0x120 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682700]  [<ffffffffa059d076>] recursive_scan+0x5b6/0x6a0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682707]  [<ffffffffa059cf2c>] recursive_scan+0x46c/0x6a0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682714]  [<ffffffff8133a0f1>] ? submit_bio+0x71/0x150
    Feb 18 19:13:22 nebula3 kernel: [293848.682720]  [<ffffffff811f6146>] ? bio_alloc_bioset+0x196/0x2a0
    Feb 18 19:13:22 nebula3 kernel: [293848.682727]  [<ffffffff811f11d0>] ? _submit_bh+0x150/0x200
    Feb 18 19:13:22 nebula3 kernel: [293848.682734]  [<ffffffffa059cf2c>] recursive_scan+0x46c/0x6a0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682744]  [<ffffffffa05bb4f5>] ? gfs2_quota_hold+0x175/0x1f0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682752]  [<ffffffffa059d25a>] trunc_dealloc+0xfa/0x120 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682760]  [<ffffffffa05a898e>] ? gfs2_glock_wait+0x3e/0x80 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682769]  [<ffffffffa05aa190>] ? gfs2_glock_nq+0x280/0x430 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682777]  [<ffffffffa059eef0>] gfs2_file_dealloc+0x10/0x20 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682787]  [<ffffffffa05c1db3>] gfs2_evict_inode+0x2b3/0x3e0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682796]  [<ffffffffa05c1c13>] ? gfs2_evict_inode+0x113/0x3e0 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682802]  [<ffffffff811d9a40>] evict+0xb0/0x1b0
    Feb 18 19:13:22 nebula3 kernel: [293848.682807]  [<ffffffff811da255>] iput+0xf5/0x180
    Feb 18 19:13:22 nebula3 kernel: [293848.682815]  [<ffffffffa05a90ec>] delete_work_func+0x5c/0x90 [gfs2]
    Feb 18 19:13:22 nebula3 kernel: [293848.682822]  [<ffffffff81083cd2>] process_one_work+0x182/0x450
    Feb 18 19:13:22 nebula3 kernel: [293848.682827]  [<ffffffff81084ac1>] worker_thread+0x121/0x410
    Feb 18 19:13:22 nebula3 kernel: [293848.682832]  [<ffffffff810849a0>] ? rescuer_thread+0x430/0x430
    Feb 18 19:13:22 nebula3 kernel: [293848.682837]  [<ffffffff8108b8a2>] kthread+0xd2/0xf0
    Feb 18 19:13:22 nebula3 kernel: [293848.682841]  [<ffffffff8108b7d0>] ? kthread_create_on_node+0x1c0/0x1c0
    Feb 18 19:13:22 nebula3 kernel: [293848.682846]  [<ffffffff817362a8>] ret_from_fork+0x58/0x90
    Feb 18 19:13:22 nebula3 kernel: [293848.682850]  [<ffffffff8108b7d0>] ? kthread_create_on_node+0x1c0/0x1c0
    Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: fsid=yggdrasil:datastores.1: fatal: filesystem consistency error
    Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: fsid=yggdrasil:datastores.1:   RG = 202135143
    Feb 18 19:13:22 nebula3 kernel: [293848.682855] GFS2: fsid=yggdrasil:datastores.1:   function = gfs2_setbit, file = /build/linux-OTIHGI/linux-3.13.0/fs/gfs2/rgrp.c, line = 103
    Feb 18 19:13:22 nebula3 kernel: [293848.682859] GFS2: fsid=yggdrasil:datastores.1: about to withdraw this file system
    Feb 18 19:13:22 nebula3 kernel: [293848.699050] GFS2: fsid=yggdrasil:datastores.1: dirty_inode: glock -5
    Feb 18 19:13:22 nebula3 kernel: [293848.705401] GFS2: fsid=yggdrasil:datastores.1: dirty_inode: glock -5

Now, the ?always faulty node? is down and I'm doing the ?gfs2_edit savemeta? from the other node.

I'm wondering if I should not upgrade the kernels to a much newer
version than 3.13.0.

My Ubuntu Trusty has some proposed kernel up to 4.2.0.

Regards.

Footnotes: 
[1]  The logs are attached to this email

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF



From daniel.dehennin at baby-gnu.org  Sat Feb 20 15:59:49 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Sat, 20 Feb 2016 16:59:49 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87lh6hhvto.fsf@hati.baby-gnu.org> (Daniel Dehennin's message of
	"Fri, 19 Feb 2016 10:51:31 +0100")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org> <87lh6hhvto.fsf@hati.baby-gnu.org>
Message-ID: <878u2fid8q.fsf@hati.baby-gnu.org>

Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:


[...]

> I preferred to do the fsck on the filesystem, two times[1], instead of
> the ?gfs2_edit savemeta?:
>
> 1. ?fsck.gfs2 -p <BLOCK DEVICE>? was quick
> 2. ?fsck.gfs2 -f -p <BLOCK DEVICE>? took 4 hours

[...]

> Footnotes: 
> [1]  The logs are attached to this email

I forgot the files but attachement does not pass.

I push everything on an HTTP server[1]:

- gfs2-fsck.log is the output of ?fsck.gfs2 -p <BLOCK DEVICE>?

- gfs2-fsck-forced.log is the output of ?fsck.gfs2 -f -p <BLOCK DEVICE>?

- gfs2.meta.gz is the ?gfs2_edit savemeta? file, with version 3.1.8 of
  gfs2_utlis.

Regards.

Footnotes: 
[1]  http://eole.ac-dijon.fr/pub/.gfs2/

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160220/b4cbab21/attachment.sig>

From daniel.dehennin at baby-gnu.org  Sun Feb 21 10:51:52 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Sun, 21 Feb 2016 11:51:52 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <878u2fid8q.fsf@hati.baby-gnu.org> (Daniel Dehennin's message of
	"Sat, 20 Feb 2016 16:59:49 +0100")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org> <87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org>
Message-ID: <874md2ibef.fsf@hati.baby-gnu.org>

Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:


[...]

> I push everything on an HTTP server[1]:
>
> - gfs2-fsck.log is the output of ?fsck.gfs2 -p <BLOCK DEVICE>?
>
> - gfs2-fsck-forced.log is the output of ?fsck.gfs2 -f -p <BLOCK DEVICE>?
>
> - gfs2.meta.gz is the ?gfs2_edit savemeta? file, with version 3.1.8 of
>   gfs2_utlis.

I just fix the perms on the .gz file, sorry.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160221/90fa4cf1/attachment.sig>

From rpeterso at redhat.com  Mon Feb 22 13:33:20 2016
From: rpeterso at redhat.com (Bob Peterson)
Date: Mon, 22 Feb 2016 08:33:20 -0500 (EST)
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <874md2ibef.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org>
	<87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org>
	<874md2ibef.fsf@hati.baby-gnu.org>
Message-ID: <947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Daniel Dehennin <daniel.dehennin at baby-gnu.org> writes:
> 
> 
> [...]
> 
> > I push everything on an HTTP server[1]:
> >
> > - gfs2-fsck.log is the output of ?fsck.gfs2 -p <BLOCK DEVICE>?
> >
> > - gfs2-fsck-forced.log is the output of ?fsck.gfs2 -f -p <BLOCK DEVICE>?
> >
> > - gfs2.meta.gz is the ?gfs2_edit savemeta? file, with version 3.1.8 of
> >   gfs2_utlis.
> 
> I just fix the perms on the .gz file, sorry.
> --
> Daniel Dehennin
> R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

Hi Daniel,

I'm downloading the metadata now. I'll let you know what I find.
It may take a while because my storage is a bit in flux at the moment.

Regards,

Bob Peterson
Red Hat File Systems



From daniel.dehennin at baby-gnu.org  Mon Feb 22 15:59:11 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Mon, 22 Feb 2016 16:59:11 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>
	(Bob Peterson's message of "Mon, 22 Feb 2016 08:33:20 -0500 (EST)")
References: <87oabijpmu.fsf@hati.baby-gnu.org>
	<1915915298.22900944.1455543765744.JavaMail.zimbra@redhat.com>
	<877fi6jbht.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org> <87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org> <874md2ibef.fsf@hati.baby-gnu.org>
	<947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>
Message-ID: <87vb5ghh2o.fsf@hati.baby-gnu.org>

Bob Peterson <rpeterso at redhat.com> writes:


[...]

> Hi Daniel,
>
> I'm downloading the metadata now. I'll let you know what I find.
> It may take a while because my storage is a bit in flux at the moment.

Ok, thanks a lot for looking at our problems.

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160222/81a2fe0f/attachment.sig>

From rpeterso at redhat.com  Tue Feb 23 18:00:31 2016
From: rpeterso at redhat.com (Bob Peterson)
Date: Tue, 23 Feb 2016 13:00:31 -0500 (EST)
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <87vb5ghh2o.fsf@hati.baby-gnu.org>
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org>
	<87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org>
	<874md2ibef.fsf@hati.baby-gnu.org>
	<947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>
	<87vb5ghh2o.fsf@hati.baby-gnu.org>
Message-ID: <1936956765.28647617.1456250431522.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Bob Peterson <rpeterso at redhat.com> writes:
> 
> 
> [...]
> 
> > Hi Daniel,
> >
> > I'm downloading the metadata now. I'll let you know what I find.
> > It may take a while because my storage is a bit in flux at the moment.
> 
> Ok, thanks a lot for looking at our problems.
> 
> Regards.
> --
> Daniel Dehennin
> R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF

Hi Daniel,

I took a look at that metadata you sent me, but I didn't find any evidence
relating to the problem you posted. Either the corruption happened a long
time prior to your saving of the metadata, or else the metadata was saved
after an fsck.gfs2 fixed (or attempted to fix) the problem?

One thing's for sure: I don't see any evidence of wild file system corruption;
certainly nothing that can account for those errors.

You said the problem seemed to revolve around a gfs2_grow operation, right?
Can you make sure the lvm2 volume group has the clustered bit set?
Please do the "vgs" command and see if that volume has "c" listed in its
flags. If not, it could have caused problems for the gfs2_grow.

I've seen problems like this very rarely. Once was a legitimate bug in
GFS2 that we fixed in RHEL5, but I assume your kernel is newer than that.
The other problem we weren't able to solve because there was no evidence
of what went wrong.

My only working theory is this:

This might be related to the transition between "unlinked" dinodes and
"free". After a file is deleted, it goes to "unlinked" and has to be
transitioned to "free". This sometimes goes wrong because of the way
it needs to check what other nodes in the cluster are doing.

Maybe: If you have three nodes, and a file was unlinked on node 1, then
maybe the internode communication got confused and nodes 2 and 3 both
tried to transition it from Unlinked to Free. That is only a theory, and
there is absolutely no proof. However, I have a set of patches that are
experimental, and not even in the upstream kernel yet (hopefully soon!)
that try to tighten up and fix problems like this. It's much more common
for multiple nodes to try to transition from Unlinked to Free, and they
all fail, leaving the file in an "Unlinked" state.

Regards,

Bob Peterson
Red Hat File Systems



From shreekant.jena at gmail.com  Wed Feb 24 08:08:21 2016
From: shreekant.jena at gmail.com (Shreekant Jena)
Date: Wed, 24 Feb 2016 13:38:21 +0530
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <1936956765.28647617.1456250431522.JavaMail.zimbra@redhat.com>
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org>
	<87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org>
	<874md2ibef.fsf@hati.baby-gnu.org>
	<947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>
	<87vb5ghh2o.fsf@hati.baby-gnu.org>
	<1936956765.28647617.1456250431522.JavaMail.zimbra@redhat.com>
Message-ID: <CALnrAkFNwscHJG0h-FR2e_VLA+ww=8N6oTzQsMEHywiqdf3x_Q@mail.gmail.com>

HI ,
I m having a problem in two node cluster . Secondary node is showing
offline after reboot.
CMAN not starting.
below are logs of offline node:-

[root at EI51SPM1 cluster]# clustat
msg_open: Invalid argument
Member Status: Inquorate

Resource Group Manager not running; no service information available.

Membership information not available
[root at EI51SPM1 cluster]# tail -10 /var/log/messages
Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request
[root at EI51SPM1 cluster]#
[root at EI51SPM1 cluster]# cman_tool status
Protocol version: 5.0.1
Config version: 166
Cluster name: IVRS_DB
Cluster ID: 9982
Cluster Member: No
Membership state: Joining
[root at EI51SPM1 cluster]# cman_tool nodes
Node  Votes Exp Sts  Name
[root at EI51SPM1 cluster]#
[root at EI51SPM1 cluster]#



Thanks & Regards,
Shreekanta Jena


On Tue, Feb 23, 2016 at 11:30 PM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- Original Message -----
> > Bob Peterson <rpeterso at redhat.com> writes:
> >
> >
> > [...]
> >
> > > Hi Daniel,
> > >
> > > I'm downloading the metadata now. I'll let you know what I find.
> > > It may take a while because my storage is a bit in flux at the moment.
> >
> > Ok, thanks a lot for looking at our problems.
> >
> > Regards.
> > --
> > Daniel Dehennin
> > R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
> > Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
>
> Hi Daniel,
>
> I took a look at that metadata you sent me, but I didn't find any evidence
> relating to the problem you posted. Either the corruption happened a long
> time prior to your saving of the metadata, or else the metadata was saved
> after an fsck.gfs2 fixed (or attempted to fix) the problem?
>
> One thing's for sure: I don't see any evidence of wild file system
> corruption;
> certainly nothing that can account for those errors.
>
> You said the problem seemed to revolve around a gfs2_grow operation, right?
> Can you make sure the lvm2 volume group has the clustered bit set?
> Please do the "vgs" command and see if that volume has "c" listed in its
> flags. If not, it could have caused problems for the gfs2_grow.
>
> I've seen problems like this very rarely. Once was a legitimate bug in
> GFS2 that we fixed in RHEL5, but I assume your kernel is newer than that.
> The other problem we weren't able to solve because there was no evidence
> of what went wrong.
>
> My only working theory is this:
>
> This might be related to the transition between "unlinked" dinodes and
> "free". After a file is deleted, it goes to "unlinked" and has to be
> transitioned to "free". This sometimes goes wrong because of the way
> it needs to check what other nodes in the cluster are doing.
>
> Maybe: If you have three nodes, and a file was unlinked on node 1, then
> maybe the internode communication got confused and nodes 2 and 3 both
> tried to transition it from Unlinked to Free. That is only a theory, and
> there is absolutely no proof. However, I have a set of patches that are
> experimental, and not even in the upstream kernel yet (hopefully soon!)
> that try to tighten up and fix problems like this. It's much more common
> for multiple nodes to try to transition from Unlinked to Free, and they
> all fail, leaving the file in an "Unlinked" state.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160224/7a1e9fba/attachment.htm>

From daniel.dehennin at baby-gnu.org  Wed Feb 24 09:05:06 2016
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Wed, 24 Feb 2016 10:05:06 +0100
Subject: [Linux-cluster] GFS2 filesystem consistency error
In-Reply-To: <1936956765.28647617.1456250431522.JavaMail.zimbra@redhat.com>
	(Bob Peterson's message of "Tue, 23 Feb 2016 13:00:31 -0500 (EST)")
References: <87oabijpmu.fsf@hati.baby-gnu.org> <56C1E413.20503@redhat.com>
	<87y4amhviy.fsf@hati.baby-gnu.org> <87lh6hhvto.fsf@hati.baby-gnu.org>
	<878u2fid8q.fsf@hati.baby-gnu.org> <874md2ibef.fsf@hati.baby-gnu.org>
	<947384755.27351195.1456148000417.JavaMail.zimbra@redhat.com>
	<87vb5ghh2o.fsf@hati.baby-gnu.org>
	<1936956765.28647617.1456250431522.JavaMail.zimbra@redhat.com>
Message-ID: <87fuwih41p.fsf@hati.baby-gnu.org>

Bob Peterson <rpeterso at redhat.com> writes:

> Hi Daniel,

Hello,


> I took a look at that metadata you sent me, but I didn't find any evidence
> relating to the problem you posted. Either the corruption happened a long
> time prior to your saving of the metadata, or else the metadata was saved
> after an fsck.gfs2 fixed (or attempted to fix) the problem?

- when I first encountered the problem, I did an fsck on the filesystem
  with version 3.1.6 from Ubuntu.

- several days after, the same messages ?dirty_inode: glock -5?
  start showing on the same node as the first time.

- I did an fsck with 3.1.8 build from git

- few days after, the same node had the ?dirty_inode? messages, I
  shutdown that node and then run the ?gfs2_edit savemeta?.

All nodes are same hardware and OS/kernel/pacemaker version.

> One thing's for sure: I don't see any evidence of wild file system corruption;
> certainly nothing that can account for those errors.
>
> You said the problem seemed to revolve around a gfs2_grow operation,
> right?

Not exactly, I live grow the fs 6 months ago and encounter some
troubles, I did an fsck by that time and the fs run fine for months.

Then we had the ?dirty_inode? troubles starting on Feb 9.

> Can you make sure the lvm2 volume group has the clustered bit set?
> Please do the "vgs" command and see if that volume has "c" listed in its
> flags. If not, it could have caused problems for the gfs2_grow.

Yes it has the cluster flag.

> I've seen problems like this very rarely. Once was a legitimate bug in
> GFS2 that we fixed in RHEL5, but I assume your kernel is newer than
> that.

We have 3.13.0-78-generic from Ubuntu.


[...]

> My only working theory is this:
>
> This might be related to the transition between "unlinked" dinodes and
> "free". After a file is deleted, it goes to "unlinked" and has to be
> transitioned to "free". This sometimes goes wrong because of the way
> it needs to check what other nodes in the cluster are doing.
>
> Maybe: If you have three nodes, and a file was unlinked on node 1, then
> maybe the internode communication got confused and nodes 2 and 3 both
> tried to transition it from Unlinked to Free. That is only a theory, and
> there is absolutely no proof. However, I have a set of patches that are
> experimental, and not even in the upstream kernel yet (hopefully soon!)
> that try to tighten up and fix problems like this. It's much more common
> for multiple nodes to try to transition from Unlinked to Free, and they
> all fail, leaving the file in an "Unlinked" state.

Thanks for the explanations, so I try to re-add the down node to the
cluster and see what happen.

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160224/fc8a7f9a/attachment.sig>

From shreekant.jena at gmail.com  Wed Feb 24 10:11:16 2016
From: shreekant.jena at gmail.com (Shreekant Jena)
Date: Wed, 24 Feb 2016 15:41:16 +0530
Subject: [Linux-cluster] Linux Cluster secondary node not coming up after
	reboot
Message-ID: <CALnrAkF1CTH6q1BGxJq=8BFONm3gBt9B6nMfMORpii0GeGKnYA@mail.gmail.com>

Hi ,
I m having a problem in two node cluster . Secondary node is showing
offline after reboot.
CMAN not starting.
below are logs of offline node:-

[root at EI51SPM1 cluster]# clustat
msg_open: Invalid argument
Member Status: Inquorate

Resource Group Manager not running; no service information available.

Membership information not available
[root at EI51SPM1 cluster]# tail -10 /var/log/messages
Feb 24 13:36:23 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:23 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:27 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:28 EI51SPM1 kernel: CMAN: sending membership request
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Cluster is not quorate.  Refusing
connection.
Feb 24 13:36:32 EI51SPM1 ccsd[25487]: Error while processing connect:
Connection refused
Feb 24 13:36:33 EI51SPM1 kernel: CMAN: sending membership request
[root at EI51SPM1 cluster]#
[root at EI51SPM1 cluster]# cman_tool status
Protocol version: 5.0.1
Config version: 166
Cluster name: IVRS_DB
Cluster ID: 9982
Cluster Member: No
Membership state: Joining
[root at EI51SPM1 cluster]# cman_tool nodes
Node  Votes Exp Sts  Name
[root at EI51SPM1 cluster]#
[root at EI51SPM1 cluster]#



Thanks & Regards,
Shreekanta Jena
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160224/c9c6718a/attachment.htm>

From eivind at aminor.no  Fri Feb 26 10:27:58 2016
From: eivind at aminor.no (Eivind Olsen)
Date: Fri, 26 Feb 2016 11:27:58 +0100
Subject: [Linux-cluster] Why will only 2 of 3 Oracle instances start?
Message-ID: <1456482478.3796.17.camel@aminor.no>

Hello.

I currently have an issue with a cluster. It's running RHEL 6 (I
believe it's 6.7), with the normal Ricci, Luci, etc.

The cluster has been working fine for over a year, until it suddenly
didn't work fine anymore. Basically, it has three Oracle databases
running, from the same Oracle-home, but with different data-
directories. Both the Oracle software and datafiles are on shared
storage (SAN), using clvm (so it's an active/passive setup, only one of
the two nodes can have the disk, filesystem etc.).

There are three databases, and two of them will run fine on either
node. The problem is the third instance, which will only run on one of
the nodes. The short story is: I get told "ORA-01081: cannot start
already-running ORACLE - shut it down first" on the non-working
database. I'm kind of hoping someone will have seen this or something
similar before and be able to give me a friendly nudge in the right
direction :)

Here's the extract from cluster.conf :

>From resources:
<oralistener home="/u01/app/oracle/product/11.2" name="Listener"
user="oracle"/>
<orainstance home="/u01/app/oracle/product/11.2" name="EDB"
user="oracle"/>
<orainstance home="/u01/app/oracle/product/11.2" name="BB"
user="oracle"/>
<orainstance home="/u01/app/oracle/product/11.2" name="MYDB"
user="oracle"/>

>From service block:
<oralistener ref="Listener"/>
<orainstance ref="EDB"/>
<orainstance ref="BB"/>
<orainstance ref="MYDB"/>

In /var/log/cluster/rgmanager.log I see the following:
Feb 25 01:06:47 rgmanager [oralistener] Validating configuration for
Listener
Feb 25 01:06:49 rgmanager [orainstance] Validating configuration for
EDB
Feb 25 01:07:07 rgmanager [orainstance] Validating configuration for BB
Feb 25 01:07:15 rgmanager [orainstance] Validating configuration for
MYDB
Feb 25 01:07:16 rgmanager start on orainstance "MYDB" returned 1
(generic error)
Feb 25 01:07:16 rgmanager #68: Failed to start service:my-cluster-01-
db; return value: 1
Feb 25 01:07:16 rgmanager Stopping service service:my-cluster-01-db

I can also see Oracle processes running, for the listener and the two
other databases, so that part is OK.

I've added extra syslog debugging, and this is what I see in
/var/log/messages:

Feb 25 01:07:15 dbserv02 rgmanager[50083]: [orainstance] Validating
configuration for MYDB
Feb 25 01:07:15 dbserv02 logger[50127]: Validating configuration for
MYDB
Feb 25 01:07:15 dbserv02 logger[50135]: Validation checks for MYDB
succeeded
Feb 25 01:07:15 dbserv02 logger[50136]: Starting service MYDB
Feb 25 01:07:15 dbserv02 logger[50137]: Starting Oracle DB MYDB
Feb 25 01:07:16 dbserv02 logger[50167]: [MYDB] [0] sent set heading
off;\nstartup;\nquit;\n
Feb 25 01:07:16 dbserv02 logger[50168]: [MYDB] [0] got ORA-01081:
cannot start already-running ORACLE - shut it down first
Feb 25 01:07:16 dbserv02 logger[50172]: Starting Oracle DB MYDB failed,
found errors in stdout
Feb 25 01:07:16 dbserv02 logger[50173]: Starting service MYDB failed
Feb 25 01:07:16 dbserv02 rgmanager[7467]: start on orainstance "MYDB"
returned 1 (generic error)
Feb 25 01:07:16 dbserv02 rgmanager[7467]: #68: Failed to start
service:my-cluster-01-db; return value: 1
Feb 25 01:07:16 dbserv02 rgmanager[7467]: Stopping service service:my-
cluster-01-db

I know the non-working database instance has previously been running
fine on the node we now see the problem with. I guess something must
have changed, but I'm currently not sure where I should look.
Oh, one problem we did see: initially neither of the databases would
run on this node because someone had decided to remove the user
"oracle" from the "dba" group.

Regards
Eivind Olsen



From emi2fast at gmail.com  Sat Feb 27 00:10:11 2016
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 27 Feb 2016 01:10:11 +0100
Subject: [Linux-cluster] Why will only 2 of 3 Oracle instances start?
In-Reply-To: <1456482478.3796.17.camel@aminor.no>
References: <1456482478.3796.17.camel@aminor.no>
Message-ID: <CAE7pJ3BPAdZ2aTcbcaRs8UJ-DPoRiwtnAf8rikBCooyB8yBX2Q@mail.gmail.com>

First of all, I'm not a DBA, but that errors means that oracle doesn't
remove shared memory in shutdown process, so remove the shared memory
segment.

2016-02-26 11:27 GMT+01:00 Eivind Olsen <eivind at aminor.no>:
> Hello.
>
> I currently have an issue with a cluster. It's running RHEL 6 (I
> believe it's 6.7), with the normal Ricci, Luci, etc.
>
> The cluster has been working fine for over a year, until it suddenly
> didn't work fine anymore. Basically, it has three Oracle databases
> running, from the same Oracle-home, but with different data-
> directories. Both the Oracle software and datafiles are on shared
> storage (SAN), using clvm (so it's an active/passive setup, only one of
> the two nodes can have the disk, filesystem etc.).
>
> There are three databases, and two of them will run fine on either
> node. The problem is the third instance, which will only run on one of
> the nodes. The short story is: I get told "ORA-01081: cannot start
> already-running ORACLE - shut it down first" on the non-working
> database. I'm kind of hoping someone will have seen this or something
> similar before and be able to give me a friendly nudge in the right
> direction :)
>
> Here's the extract from cluster.conf :
>
> >From resources:
> <oralistener home="/u01/app/oracle/product/11.2" name="Listener"
> user="oracle"/>
> <orainstance home="/u01/app/oracle/product/11.2" name="EDB"
> user="oracle"/>
> <orainstance home="/u01/app/oracle/product/11.2" name="BB"
> user="oracle"/>
> <orainstance home="/u01/app/oracle/product/11.2" name="MYDB"
> user="oracle"/>
>
> >From service block:
> <oralistener ref="Listener"/>
> <orainstance ref="EDB"/>
> <orainstance ref="BB"/>
> <orainstance ref="MYDB"/>
>
> In /var/log/cluster/rgmanager.log I see the following:
> Feb 25 01:06:47 rgmanager [oralistener] Validating configuration for
> Listener
> Feb 25 01:06:49 rgmanager [orainstance] Validating configuration for
> EDB
> Feb 25 01:07:07 rgmanager [orainstance] Validating configuration for BB
> Feb 25 01:07:15 rgmanager [orainstance] Validating configuration for
> MYDB
> Feb 25 01:07:16 rgmanager start on orainstance "MYDB" returned 1
> (generic error)
> Feb 25 01:07:16 rgmanager #68: Failed to start service:my-cluster-01-
> db; return value: 1
> Feb 25 01:07:16 rgmanager Stopping service service:my-cluster-01-db
>
> I can also see Oracle processes running, for the listener and the two
> other databases, so that part is OK.
>
> I've added extra syslog debugging, and this is what I see in
> /var/log/messages:
>
> Feb 25 01:07:15 dbserv02 rgmanager[50083]: [orainstance] Validating
> configuration for MYDB
> Feb 25 01:07:15 dbserv02 logger[50127]: Validating configuration for
> MYDB
> Feb 25 01:07:15 dbserv02 logger[50135]: Validation checks for MYDB
> succeeded
> Feb 25 01:07:15 dbserv02 logger[50136]: Starting service MYDB
> Feb 25 01:07:15 dbserv02 logger[50137]: Starting Oracle DB MYDB
> Feb 25 01:07:16 dbserv02 logger[50167]: [MYDB] [0] sent set heading
> off;\nstartup;\nquit;\n
> Feb 25 01:07:16 dbserv02 logger[50168]: [MYDB] [0] got ORA-01081:
> cannot start already-running ORACLE - shut it down first
> Feb 25 01:07:16 dbserv02 logger[50172]: Starting Oracle DB MYDB failed,
> found errors in stdout
> Feb 25 01:07:16 dbserv02 logger[50173]: Starting service MYDB failed
> Feb 25 01:07:16 dbserv02 rgmanager[7467]: start on orainstance "MYDB"
> returned 1 (generic error)
> Feb 25 01:07:16 dbserv02 rgmanager[7467]: #68: Failed to start
> service:my-cluster-01-db; return value: 1
> Feb 25 01:07:16 dbserv02 rgmanager[7467]: Stopping service service:my-
> cluster-01-db
>
> I know the non-working database instance has previously been running
> fine on the node we now see the problem with. I guess something must
> have changed, but I'm currently not sure where I should look.
> Oh, one problem we did see: initially neither of the databases would
> run on this node because someone had decided to remove the user
> "oracle" from the "dba" group.
>
> Regards
> Eivind Olsen
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^



From richter at richtercloud.de  Sun Feb 28 23:35:29 2016
From: richter at richtercloud.de (Karl-Philipp Richter)
Date: Mon, 29 Feb 2016 00:35:29 +0100
Subject: [Linux-cluster] make fails "config/libs/libccsconfdb/libccs.so:
 undefined reference to `confdb_key_iter_typed2'"
Message-ID: <56D38441.5040603@richtercloud.de>

`./configure && make` fails with

    cc -o ccs_tool ccs_tool.o editconf.o
-L/mnt/main/sources/cluster/config/libs/libccsconfdb -lccs `xml2-config
--libs` -L/usr/lib
    /mnt/main/sources/cluster/config/libs/libccsconfdb/libccs.so:
undefined reference to `confdb_key_iter_typed2'
    /mnt/main/sources/cluster/config/libs/libccsconfdb/libccs.so:
undefined reference to `confdb_key_get_typed2'
    collect2: error: ld returned 1 exit status
    Makefile:29: recipe for target 'ccs_tool' failed
    make[3]: *** [ccs_tool] Error 1

See https://travis-ci.org/krichter722/cluster/builds/112480859 for
details. Also experienced on Ubuntu 15.10.

experienced with cluster-3.2.0-25-g720cbde



From richter at richtercloud.de  Sun Feb 28 23:39:20 2016
From: richter at richtercloud.de (Karl-Philipp Richter)
Date: Mon, 29 Feb 2016 00:39:20 +0100
Subject: [Linux-cluster] [PATCH] added initial .travis.yml
Message-ID: <56D38528.9020308@richtercloud.de>

Support for the https://travis-ci.org CI service.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-added-initial-.travis.yml.patch
Type: text/x-patch
Size: 780 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20160229/ec950a30/attachment.bin>