From hunters1094 at gmail.com  Wed Sep  2 11:56:42 2015
From: hunters1094 at gmail.com (=?UTF-8?B?Tmd1eeG7hW4gVHLGsOG7nW5nIFPGoW4=?=)
Date: Wed, 2 Sep 2015 18:56:42 +0700
Subject: [Linux-cluster] [Linux cluster] DLM not start
Message-ID: <CA+72v=qdX1PG7ebUSvbDJWCTAjb_XOqjrNADdVK7i_1m-O6Ehg@mail.gmail.com>

Dear all

I have 2 nodes deployed cluster with gfs2, my storage is FC with multipath.

I run like tutorial in http://clusterlabs.org/doc/Cluster_from_Scratch.pdf

# pcs status

Cluster name: clustered
Last updated: Wed Sep  2 18:40:28 2015
Last change: Wed Sep  2 18:40:02 2015
Stack: corosync
Current DC: node02 (2) - partition with quorum
Version: 1.1.12-a14efad
2 Nodes configured
0 Resources configured


Online: [ node01 node02 ]

Full list of resources:


PCSD Status:
  node01: Online
  node02: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

When i create resource dlm:

# pcs cluster cib dlm_cfg
# pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op monitor
interval=60s
# pcs -f dlm_cfg resource clone dlm clone-max=2 clone-node-max=1
# pcs -f dlm_cfg resource show
#  pcs cluster cib-push dlm_cfg

#  pcs status  (get error in the resources section)

Full list of resources:

 Clone Set: dlm-clone [dlm]
     Stopped: [ node01 node02 ]

Failed actions:
    dlm_start_0 on node01 'not configured' (6): call=69, status=complete,
exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015', queued=1ms,
exec=50ms
    dlm_start_0 on node02 'not configured' (6): call=65, status=complete,
exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015', queued=0ms,
exec=50ms

And in the /var/log/pacemaker.log get error

controld(dlm)[24304]:    2015/09/02_18:47:13 ERROR: The cluster property
stonith-enabled may not be deactivated to use the DLM
Sep 02 18:47:13 [4204] node01       lrmd:     info: log_finished:
finished - rsc:dlm action:start call_id:65 pid:24304 exit-code:6
exec-time:50ms queue-time:0ms
Sep 02 18:47:14 [4207] node01       crmd:     info: action_synced_wait:
Managed controld_meta-data_0 process 24329 exited with rc=0
Sep 02 18:47:14 [4207] node01       crmd:   notice: process_lrm_event:
Operation dlm_start_0: not configured (node=node01, call=65, rc=6,
cib-update=75, confirmed=true)
Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
    Forwarding cib_modify operation for section status to master
(origin=local/crmd/75)
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
Diff: --- 0.54.17 2
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
Diff: +++ 0.54.18 (null)
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
/cib:  @num_updates=18
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
@operation_key=dlm_start_0, @operation=start,
@transition-key=7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
@transition-magic=0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
@call-id=69, @rc-code=6, @exec-time=50, @queue-time=1
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     ++
/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']:
<lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
operation="start" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.9"
transition-key="7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
transition-magic="0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
call-id="69" rc-code="6" op-status="0" interval="0" last
Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
    Completed cib_modify operation for section status: OK (rc=0,
origin=node02/crmd/555, version=0.54.18)
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
Diff: --- 0.54.18 2
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
Diff: +++ 0.54.19 (null)
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
/cib:  @num_updates=19
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
@operation_key=dlm_start_0, @operation=start,
@transition-key=9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
@transition-magic=0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
@call-id=65, @rc-code=6, @exec-time=50
Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     ++
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']:
<lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
operation="start" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.9"
transition-key="9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
transition-magic="0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
call-id="65" rc-code="6" op-status="0" interval="0" last
Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
Setting fail-count-dlm[node02]: (null) -> INFINITY from node02
Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
Setting last-failure-dlm[node02]: (null) -> 1441194434 from node02
Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
    Completed cib_modify operation for section status: OK (rc=0,
origin=node01/crmd/75, version=0.54.19)


Thank you very much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150902/1dd698a2/attachment.htm>

From bubble at hoster-ok.com  Wed Sep  2 12:14:27 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Wed, 2 Sep 2015 15:14:27 +0300
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports 1gb
	is free
Message-ID: <55E6E823.6060302@hoster-ok.com>

Hi,

I've got weird state on GFS2 (activated only on one node from the very 
beginning, but with dlm locking), when I'm unable to write with 'No 
space left on device' error, but df -m reports:
/dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging

Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.

That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.

What could be the reason for such desync?
Is there a way to fix that?

Any help is appreciated,

thank you,
Vladislav


From rpeterso at redhat.com  Wed Sep  2 12:36:49 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 2 Sep 2015 08:36:49 -0400 (EDT)
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb	is free
In-Reply-To: <55E6E823.6060302@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
Message-ID: <1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hi,
> 
> I've got weird state on GFS2 (activated only on one node from the very
> beginning, but with dlm locking), when I'm unable to write with 'No
> space left on device' error, but df -m reports:
> /dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging
> 
> Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.
> 
> That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.
> 
> What could be the reason for such desync?
> Is there a way to fix that?
> 
> Any help is appreciated,
> 
> thank you,
> Vladislav

Hi Vladislav,

It sounds like maybe your system statfs file has gotten out of sync with
the actual free space. We've seen this before, and have bugzilla records
open to fix it. https://bugzilla.redhat.com/show_bug.cgi?id=1191219

Ordinarily that should not prevent blocks from being allocated, because
unlinked dinodes should be automatically reclaimed as needed.
It could be a fragmentation issue: Maybe there's enough free space, but
the free space is too fragmented to allow for a required block allocation.

So it is difficult to say what exactly is going on. If you want to send
me your file system metadata, I'd be happy to examine it and let you
know what I find. This can be saved with: gfs2_edit savemeta <device> <output file>
The resulting files are often too big to email, so you may need to put
it on an ftp server or something instead.

Also, bear in mind that GFS2 has a severe performance penalty when your
file system is nearly full. The less free space available, the more time
it takes to find free space. So you'll probably get much better performance
if you make the file system bigger (lvextend the volume then gfs2_grow).

Regards,

Bob Peterson
Red Hat File Systems


From bubble at hoster-ok.com  Wed Sep  2 13:17:40 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Wed, 2 Sep 2015 16:17:40 +0300
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
Message-ID: <55E6F6F4.5010506@hoster-ok.com>

02.09.2015 15:36, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>> I've got weird state on GFS2 (activated only on one node from the very
>> beginning, but with dlm locking), when I'm unable to write with 'No
>> space left on device' error, but df -m reports:
>> /dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging
>>
>> Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.
>>
>> That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.
>>
>> What could be the reason for such desync?
>> Is there a way to fix that?
>>
>> Any help is appreciated,
>>
>> thank you,
>> Vladislav
>
> Hi Vladislav,

Thank you Bob for a rapid answer!

>
> It sounds like maybe your system statfs file has gotten out of sync with
> the actual free space. We've seen this before, and have bugzilla records
> open to fix it. https://bugzilla.redhat.com/show_bug.cgi?id=1191219

That is not public unfortunately. Anyways, I'm willing to help to fix 
that as the project I'm currently work on is relying on GFS2 to be fast, 
stable and free from such defects.

>
> Ordinarily that should not prevent blocks from being allocated, because
> unlinked dinodes should be automatically reclaimed as needed.
> It could be a fragmentation issue: Maybe there's enough free space, but
> the free space is too fragmented to allow for a required block allocation.
>
> So it is difficult to say what exactly is going on. If you want to send
> me your file system metadata, I'd be happy to examine it and let you
> know what I find. This can be saved with: gfs2_edit savemeta <device> <output file>
> The resulting files are often too big to email, so you may need to put
> it on an ftp server or something instead.

Yes, that is a filesystem used only for tests so I can provide its 
metadata. I'll sent a link to you when it is ready.
Should fs be unmounted btw?

>
> Also, bear in mind that GFS2 has a severe performance penalty when your
> file system is nearly full. The less free space available, the more time
> it takes to find free space. So you'll probably get much better performance
> if you make the file system bigger (lvextend the volume then gfs2_grow).

That is actually a desired behavior, more, I use controlled IO 
throttling when it becomes full. That is temporary tier-0 area, all 
files go to another storage asynchronously.

Thank you,

Vladislav


From rpeterso at redhat.com  Wed Sep  2 13:34:33 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 2 Sep 2015 09:34:33 -0400 (EDT)
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <55E6F6F4.5010506@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55E6F6F4.5010506@hoster-ok.com>
Message-ID: <2133866406.21458107.1441200873479.JavaMail.zimbra@redhat.com>

----- Original Message -----
(snip)
> Yes, that is a filesystem used only for tests so I can provide its
> metadata. I'll sent a link to you when it is ready.
> Should fs be unmounted btw?

Yes, the file system should be unmounted from all nodes before doing
gfs2_edit savemeta.

Is your work going to be available on RHEL7/Centos7 or is it just RHEL6/Centos6?
We've done many performance improvements in RHEL7 that have not been
back-ported to RHEL6, so if you're after better performance and reliability,
you may consider RHEL7/Centos7.

Regards,

Bob Peterson
Red Hat File Systems


From emi2fast at gmail.com  Wed Sep  2 13:39:49 2015
From: emi2fast at gmail.com (emmanuel segura)
Date: Wed, 2 Sep 2015 15:39:49 +0200
Subject: [Linux-cluster] [Linux cluster] DLM not start
In-Reply-To: <CA+72v=qdX1PG7ebUSvbDJWCTAjb_XOqjrNADdVK7i_1m-O6Ehg@mail.gmail.com>
References: <CA+72v=qdX1PG7ebUSvbDJWCTAjb_XOqjrNADdVK7i_1m-O6Ehg@mail.gmail.com>
Message-ID: <CAE7pJ3CavpfzNP+dNcc4n_+QjupwMKfeQym1ZV=PJSZK8nLL0A@mail.gmail.com>

please, use fencing.

2015-09-02 13:56 GMT+02:00 Nguy?n Tr??ng S?n <hunters1094 at gmail.com>:
> Dear all
>
> I have 2 nodes deployed cluster with gfs2, my storage is FC with multipath.
>
> I run like tutorial in http://clusterlabs.org/doc/Cluster_from_Scratch.pdf
>
> # pcs status
>
> Cluster name: clustered
> Last updated: Wed Sep  2 18:40:28 2015
> Last change: Wed Sep  2 18:40:02 2015
> Stack: corosync
> Current DC: node02 (2) - partition with quorum
> Version: 1.1.12-a14efad
> 2 Nodes configured
> 0 Resources configured
>
>
> Online: [ node01 node02 ]
>
> Full list of resources:
>
>
> PCSD Status:
>   node01: Online
>   node02: Online
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> When i create resource dlm:
>
> # pcs cluster cib dlm_cfg
> # pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op monitor
> interval=60s
> # pcs -f dlm_cfg resource clone dlm clone-max=2 clone-node-max=1
> # pcs -f dlm_cfg resource show
> #  pcs cluster cib-push dlm_cfg
>
> #  pcs status  (get error in the resources section)
>
> Full list of resources:
>
>  Clone Set: dlm-clone [dlm]
>      Stopped: [ node01 node02 ]
>
> Failed actions:
>     dlm_start_0 on node01 'not configured' (6): call=69, status=complete,
> exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015', queued=1ms,
> exec=50ms
>     dlm_start_0 on node02 'not configured' (6): call=65, status=complete,
> exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015', queued=0ms,
> exec=50ms
>
> And in the /var/log/pacemaker.log get error
>
> controld(dlm)[24304]:    2015/09/02_18:47:13 ERROR: The cluster property
> stonith-enabled may not be deactivated to use the DLM
> Sep 02 18:47:13 [4204] node01       lrmd:     info: log_finished:
> finished - rsc:dlm action:start call_id:65 pid:24304 exit-code:6
> exec-time:50ms queue-time:0ms
> Sep 02 18:47:14 [4207] node01       crmd:     info: action_synced_wait:
> Managed controld_meta-data_0 process 24329 exited with rc=0
> Sep 02 18:47:14 [4207] node01       crmd:   notice: process_lrm_event:
> Operation dlm_start_0: not configured (node=node01, call=65, rc=6,
> cib-update=75, confirmed=true)
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> Forwarding cib_modify operation for section status to master
> (origin=local/crmd/75)
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> Diff: --- 0.54.17 2
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> Diff: +++ 0.54.18 (null)
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> /cib:  @num_updates=18
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
> @operation_key=dlm_start_0, @operation=start,
> @transition-key=7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> @transition-magic=0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> @call-id=69, @rc-code=6, @exec-time=50, @queue-time=1
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     ++
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']:
> <lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
> operation="start" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.9"
> transition-key="7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> transition-magic="0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> call-id="69" rc-code="6" op-status="0" interval="0" last
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=node02/crmd/555, version=0.54.18)
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> Diff: --- 0.54.18 2
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> Diff: +++ 0.54.19 (null)
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> /cib:  @num_updates=19
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
> @operation_key=dlm_start_0, @operation=start,
> @transition-key=9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> @transition-magic=0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> @call-id=65, @rc-code=6, @exec-time=50
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     ++
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']:
> <lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
> operation="start" crm-debug-origin="do_update_resource"
> crm_feature_set="3.0.9"
> transition-key="9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> transition-magic="0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> call-id="65" rc-code="6" op-status="0" interval="0" last
> Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
> Setting fail-count-dlm[node02]: (null) -> INFINITY from node02
> Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
> Setting last-failure-dlm[node02]: (null) -> 1441194434 from node02
> Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=node01/crmd/75, version=0.54.19)
>
>
> Thank you very much.
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^


From bubble at hoster-ok.com  Wed Sep  2 13:52:17 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Wed, 2 Sep 2015 16:52:17 +0300
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <2133866406.21458107.1441200873479.JavaMail.zimbra@redhat.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55E6F6F4.5010506@hoster-ok.com>
	<2133866406.21458107.1441200873479.JavaMail.zimbra@redhat.com>
Message-ID: <55E6FF11.5090002@hoster-ok.com>

02.09.2015 16:34, Bob Peterson wrote:
> ----- Original Message -----
> (snip)
>> Yes, that is a filesystem used only for tests so I can provide its
>> metadata. I'll sent a link to you when it is ready.
>> Should fs be unmounted btw?
>
> Yes, the file system should be unmounted from all nodes before doing
> gfs2_edit savemeta.
>
> Is your work going to be available on RHEL7/Centos7 or is it just RHEL6/Centos6?
> We've done many performance improvements in RHEL7 that have not been
> back-ported to RHEL6, so if you're after better performance and reliability,
> you may consider RHEL7/Centos7.

That is centos6 with backported cluster stack (latest corosync, 
pacemaker, dlm, clvmd rebuilt for corosync2, experimental rewrite of 
gfs_controld for corosync2).

Thank you for hint about performance, unfortunately my customer is not 
willing to upgrade to centos7 yet (that is really huge task), but that 
could be a valid reason for future releases if we see insufficient 
performance. Current performance tests (at least pure throughput) with 
DRBD over IPoIB and dedicated corosync/dlm link are satisfying (10Gbps 
link is fully saturated by NFS/CIFS clients). And we do not have many 
metadata operations on GFS2 level.

Do you have any plans to backport that improvements btw?

Thank you,
Vladislav


From hunters1094 at gmail.com  Wed Sep  2 13:58:06 2015
From: hunters1094 at gmail.com (=?UTF-8?B?Tmd1eeG7hW4gVHLGsOG7nW5nIFPGoW4=?=)
Date: Wed, 2 Sep 2015 20:58:06 +0700
Subject: [Linux-cluster] [Linux cluster] DLM not start
In-Reply-To: <CAE7pJ3CavpfzNP+dNcc4n_+QjupwMKfeQym1ZV=PJSZK8nLL0A@mail.gmail.com>
References: <CA+72v=qdX1PG7ebUSvbDJWCTAjb_XOqjrNADdVK7i_1m-O6Ehg@mail.gmail.com>
	<CAE7pJ3CavpfzNP+dNcc4n_+QjupwMKfeQym1ZV=PJSZK8nLL0A@mail.gmail.com>
Message-ID: <CA+72v=oJhRRVg9EW3OofWiGCDeK+OVqhyEcRhJL0PvqZrfZ1KA@mail.gmail.com>

How can i use fencing?

Do you mean "pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op
monitor interval=60s on-fail=fence"


It is still error.

I have Centos 7.0, with pacemaker-1.1.12-22.el7_1.2.x86_64


2015-09-02 20:39 GMT+07:00 emmanuel segura <emi2fast at gmail.com>:

> please, use fencing.
>
> 2015-09-02 13:56 GMT+02:00 Nguy?n Tr??ng S?n <hunters1094 at gmail.com>:
> > Dear all
> >
> > I have 2 nodes deployed cluster with gfs2, my storage is FC with
> multipath.
> >
> > I run like tutorial in
> http://clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >
> > # pcs status
> >
> > Cluster name: clustered
> > Last updated: Wed Sep  2 18:40:28 2015
> > Last change: Wed Sep  2 18:40:02 2015
> > Stack: corosync
> > Current DC: node02 (2) - partition with quorum
> > Version: 1.1.12-a14efad
> > 2 Nodes configured
> > 0 Resources configured
> >
> >
> > Online: [ node01 node02 ]
> >
> > Full list of resources:
> >
> >
> > PCSD Status:
> >   node01: Online
> >   node02: Online
> >
> > Daemon Status:
> >   corosync: active/enabled
> >   pacemaker: active/enabled
> >   pcsd: active/enabled
> >
> > When i create resource dlm:
> >
> > # pcs cluster cib dlm_cfg
> > # pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld op monitor
> > interval=60s
> > # pcs -f dlm_cfg resource clone dlm clone-max=2 clone-node-max=1
> > # pcs -f dlm_cfg resource show
> > #  pcs cluster cib-push dlm_cfg
> >
> > #  pcs status  (get error in the resources section)
> >
> > Full list of resources:
> >
> >  Clone Set: dlm-clone [dlm]
> >      Stopped: [ node01 node02 ]
> >
> > Failed actions:
> >     dlm_start_0 on node01 'not configured' (6): call=69, status=complete,
> > exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015',
> queued=1ms,
> > exec=50ms
> >     dlm_start_0 on node02 'not configured' (6): call=65, status=complete,
> > exit-reason='none', last-rc-change='Wed Sep  2 18:47:13 2015',
> queued=0ms,
> > exec=50ms
> >
> > And in the /var/log/pacemaker.log get error
> >
> > controld(dlm)[24304]:    2015/09/02_18:47:13 ERROR: The cluster property
> > stonith-enabled may not be deactivated to use the DLM
> > Sep 02 18:47:13 [4204] node01       lrmd:     info: log_finished:
> > finished - rsc:dlm action:start call_id:65 pid:24304 exit-code:6
> > exec-time:50ms queue-time:0ms
> > Sep 02 18:47:14 [4207] node01       crmd:     info: action_synced_wait:
> > Managed controld_meta-data_0 process 24329 exited with rc=0
> > Sep 02 18:47:14 [4207] node01       crmd:   notice: process_lrm_event:
> > Operation dlm_start_0: not configured (node=node01, call=65, rc=6,
> > cib-update=75, confirmed=true)
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> > Forwarding cib_modify operation for section status to master
> > (origin=local/crmd/75)
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> > Diff: --- 0.54.17 2
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> > Diff: +++ 0.54.18 (null)
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> > /cib:  @num_updates=18
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> >
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
> > @operation_key=dlm_start_0, @operation=start,
> > @transition-key=7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> > @transition-magic=0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> > @call-id=69, @rc-code=6, @exec-time=50, @queue-time=1
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
>  ++
> >
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dlm']:
> > <lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
> > operation="start" crm-debug-origin="do_update_resource"
> > crm_feature_set="3.0.9"
> > transition-key="7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> > transition-magic="0:6;7:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> > call-id="69" rc-code="6" op-status="0" interval="0" last
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> > Completed cib_modify operation for section status: OK (rc=0,
> > origin=node02/crmd/555, version=0.54.18)
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> > Diff: --- 0.54.18 2
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
> > Diff: +++ 0.54.19 (null)
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> > /cib:  @num_updates=19
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:     +
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']/lrm_rsc_op[@id='dlm_last_0']:
> > @operation_key=dlm_start_0, @operation=start,
> > @transition-key=9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> > @transition-magic=0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5,
> > @call-id=65, @rc-code=6, @exec-time=50
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_perform_op:
>  ++
> >
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='dlm']:
> > <lrm_rsc_op id="dlm_last_failure_0" operation_key="dlm_start_0"
> > operation="start" crm-debug-origin="do_update_resource"
> > crm_feature_set="3.0.9"
> > transition-key="9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> > transition-magic="0:6;9:159:0:5d440f4a-656a-4bb0-8c9b-0ed09d22c7f5"
> > call-id="65" rc-code="6" op-status="0" interval="0" last
> > Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
> > Setting fail-count-dlm[node02]: (null) -> INFINITY from node02
> > Sep 02 18:47:14 [4205] node01      attrd:     info: attrd_peer_update:
> > Setting last-failure-dlm[node02]: (null) -> 1441194434 from node02
> > Sep 02 18:47:14 [4202] node01        cib:     info: cib_process_request:
> > Completed cib_modify operation for section status: OK (rc=0,
> > origin=node01/crmd/75, version=0.54.19)
> >
> >
> > Thank you very much.
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>
> --
>   .~.
>   /V\
>  //  \\
> /(   )\
> ^`~'^
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster


-- 
**************************************
Nguy?n Tr??ng S?n
Tin3K50 - H? th?ng th?ng tin K50
?HBK H? N?i
Mobile: 0904010635
Y!M: hunters_1094
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150902/6a57f9b6/attachment.htm>

From lists at alteeve.ca  Wed Sep  2 16:28:58 2015
From: lists at alteeve.ca (Digimer)
Date: Wed, 2 Sep 2015 12:28:58 -0400
Subject: [Linux-cluster] [Linux cluster] DLM not start
In-Reply-To: <CA+72v=oJhRRVg9EW3OofWiGCDeK+OVqhyEcRhJL0PvqZrfZ1KA@mail.gmail.com>
References: <CA+72v=qdX1PG7ebUSvbDJWCTAjb_XOqjrNADdVK7i_1m-O6Ehg@mail.gmail.com>
	<CAE7pJ3CavpfzNP+dNcc4n_+QjupwMKfeQym1ZV=PJSZK8nLL0A@mail.gmail.com>
	<CA+72v=oJhRRVg9EW3OofWiGCDeK+OVqhyEcRhJL0PvqZrfZ1KA@mail.gmail.com>
Message-ID: <55E723CA.4040306@alteeve.ca>

On 02/09/15 09:58 AM, Nguy?n Tr??ng S?n wrote:
> How can i use fencing?
> 
> Do you mean "pcs -f dlm_cfg resource create dlm ocf:pacemaker:controld
> op monitor interval=60s on-fail=fence"
> 
> 
> It is still error.
> 
> I have Centos 7.0, with pacemaker-1.1.12-22.el7_1.2.x86_64

Fencing is a process where a lost node is removed from the cluster,
usually by rebooting it with IPMI, cutting power using a switched PDU,
etc. How you exactly do fencing depends on your environment and
potential fence devices you have.

DLM requires working fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?


From bubble at hoster-ok.com  Wed Sep  9 14:24:20 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Wed, 9 Sep 2015 17:24:20 +0300
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
Message-ID: <55F04114.7040309@hoster-ok.com>

Hi Bob,

02.09.2015 15:36, Bob Peterson wrote:
> ----- Original Message -----
>> Hi,
>>
>> I've got weird state on GFS2 (activated only on one node from the very
>> beginning, but with dlm locking), when I'm unable to write with 'No
>> space left on device' error, but df -m reports:
>> /dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging
>>
>> Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.
>>
>> That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.
>>
>> What could be the reason for such desync?
>> Is there a way to fix that?
>>
>> Any help is appreciated,
>>
>> thank you,
>> Vladislav
>
> Hi Vladislav,
>
> It sounds like maybe your system statfs file has gotten out of sync with
> the actual free space. We've seen this before, and have bugzilla records
> open to fix it. https://bugzilla.redhat.com/show_bug.cgi?id=1191219
>

Is there something I can do to help solving this issue?

Best,
Vladislav

> Ordinarily that should not prevent blocks from being allocated, because
> unlinked dinodes should be automatically reclaimed as needed.
> It could be a fragmentation issue: Maybe there's enough free space, but
> the free space is too fragmented to allow for a required block allocation.
>
> So it is difficult to say what exactly is going on. If you want to send
> me your file system metadata, I'd be happy to examine it and let you
> know what I find. This can be saved with: gfs2_edit savemeta <device> <output file>
> The resulting files are often too big to email, so you may need to put
> it on an ftp server or something instead.
>
> Also, bear in mind that GFS2 has a severe performance penalty when your
> file system is nearly full. The less free space available, the more time
> it takes to find free space. So you'll probably get much better performance
> if you make the file system bigger (lvextend the volume then gfs2_grow).
>


> Regards,
>
> Bob Peterson
> Red Hat File Systems
>


From francisconp at gmail.com  Wed Sep  9 18:12:28 2015
From: francisconp at gmail.com (Franciscon)
Date: Wed, 09 Sep 2015 18:12:28 +0000
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <55F04114.7040309@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55F04114.7040309@hoster-ok.com>
Message-ID: <CA+CM4p92O7o3+XEkn9qt2zNe15Zm_X1hL1iPKyBWprfBk4EfzQ@mail.gmail.com>

Try to check the number of inodes using "df -i", that can be 100%. If it's
true, you need to change the max number of inodes, or remove some files.

On Wed, Sep 9, 2015 at 11:29 AM Vladislav Bogdanov <bubble at hoster-ok.com>
wrote:

> Hi Bob,
>
> 02.09.2015 15:36, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >> I've got weird state on GFS2 (activated only on one node from the very
> >> beginning, but with dlm locking), when I'm unable to write with 'No
> >> space left on device' error, but df -m reports:
> >> /dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging
> >>
> >> Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.
> >>
> >> That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.
> >>
> >> What could be the reason for such desync?
> >> Is there a way to fix that?
> >>
> >> Any help is appreciated,
> >>
> >> thank you,
> >> Vladislav
> >
> > Hi Vladislav,
> >
> > It sounds like maybe your system statfs file has gotten out of sync with
> > the actual free space. We've seen this before, and have bugzilla records
> > open to fix it. https://bugzilla.redhat.com/show_bug.cgi?id=1191219
> >
>
> Is there something I can do to help solving this issue?
>
> Best,
> Vladislav
>
> > Ordinarily that should not prevent blocks from being allocated, because
> > unlinked dinodes should be automatically reclaimed as needed.
> > It could be a fragmentation issue: Maybe there's enough free space, but
> > the free space is too fragmented to allow for a required block
> allocation.
> >
> > So it is difficult to say what exactly is going on. If you want to send
> > me your file system metadata, I'd be happy to examine it and let you
> > know what I find. This can be saved with: gfs2_edit savemeta <device>
> <output file>
> > The resulting files are often too big to email, so you may need to put
> > it on an ftp server or something instead.
> >
> > Also, bear in mind that GFS2 has a severe performance penalty when your
> > file system is nearly full. The less free space available, the more time
> > it takes to find free space. So you'll probably get much better
> performance
> > if you make the file system bigger (lvextend the volume then gfs2_grow).
> >
>
>
>
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> >
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150909/bf572580/attachment.htm>

From rpeterso at redhat.com  Wed Sep  9 18:29:56 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 9 Sep 2015 14:29:56 -0400 (EDT)
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <55F04114.7040309@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55F04114.7040309@hoster-ok.com>
Message-ID: <662601044.25900031.1441823396908.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hi Bob,
> 
> 02.09.2015 15:36, Bob Peterson wrote:
> > ----- Original Message -----
> >> Hi,
> >>
> >> I've got weird state on GFS2 (activated only on one node from the very
> >> beginning, but with dlm locking), when I'm unable to write with 'No
> >> space left on device' error, but df -m reports:
> >> /dev/mapper/vg_shared-lv_shared 570875 569622 1254 100% /storage/staging
> >>
> >> Umount/mount doesn't help, umount/fsck/rmmod/mount also does nothing.
> >>
> >> That is centos6 with 2.6.32-504.30.3.el6.x86_64 kernel.
> >>
> >> What could be the reason for such desync?
> >> Is there a way to fix that?
> >>
> >> Any help is appreciated,
> >>
> >> thank you,
> >> Vladislav
> >
> > Hi Vladislav,
> >
> > It sounds like maybe your system statfs file has gotten out of sync with
> > the actual free space. We've seen this before, and have bugzilla records
> > open to fix it. https://bugzilla.redhat.com/show_bug.cgi?id=1191219
> >
> 
> Is there something I can do to help solving this issue?
> 
> Best,
> Vladislav

Hi Vladislav,

I'm sorry. I've been delayed here by urgent customer issues. I hope to have
more information shortly.

Regards,

Bob Peterson
Red Hat File Systems


From rpeterso at redhat.com  Wed Sep  9 20:22:28 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Wed, 9 Sep 2015 16:22:28 -0400 (EDT)
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <55F04114.7040309@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55F04114.7040309@hoster-ok.com>
Message-ID: <336411051.25982805.1441830148096.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Hi Bob,
> 
> Is there something I can do to help solving this issue?
> 
> Best,
> Vladislav

Hi Vladislav,

I took a look at your gfs2 file system metadata.
There is nothing corrupt or in error on your file system. The system
statfs file is totally correct.

The reason you cannot create any files is because there isn't a single
resource group in your file system that can satisfy a block allocation
request. The reason is: gfs2 needs to allocate multiple blocks at a time
for "worst case scenario" and none of your resource groups contain
enough blocks for the "worst case".

A big part of the problem is that your file system uses the absolute
minimum resource group size of 32MB (-r32 was used on mkfs.gfs2), and so
there are 17847 of them, with minimal sized bitmaps. GFS2 cannot allocate
the very last several blocks of a resource group because of the calculations
used for worst case. Because your resource groups are so small, you're
basically compounding the problem: it can't allocate blocks from a LOT
of resource groups.

Normally, your file system should have bigger resource groups, and fewer
of them. If you used a normal resource group size, like 128MB, or 256MB,
or even 2048MB, a much higher percent of the file system would be usable
because there were be fewer resource groups to cover the same area.
Does that make sense?

If you do mkfs.gfs2 and specify -r512, you will be able to use much more
of the file system, and it won't get into this problem until much later.

In the past, I've actually looked into whether we can revise the
calculations used by gfs2 for worst-case block allocations. I've still
got some patches on my system for it. But even if we do it, it won't
improve a lot, and it will take a long time to trickle out to customers.

Regards,

Bob Peterson
Red Hat File Systems


From bubble at hoster-ok.com  Thu Sep 10 12:12:10 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Thu, 10 Sep 2015 15:12:10 +0300
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <336411051.25982805.1441830148096.JavaMail.zimbra@redhat.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55F04114.7040309@hoster-ok.com>
	<336411051.25982805.1441830148096.JavaMail.zimbra@redhat.com>
Message-ID: <55F1739A.6070005@hoster-ok.com>

Thank you Bob for your analyses!

09.09.2015 23:22, Bob Peterson wrote:
> ----- Original Message -----
>> Hi Bob,
>>
>> Is there something I can do to help solving this issue?
>>
>> Best,
>> Vladislav
>
> Hi Vladislav,
>
> I took a look at your gfs2 file system metadata.
> There is nothing corrupt or in error on your file system. The system
> statfs file is totally correct.
>
> The reason you cannot create any files is because there isn't a single
> resource group in your file system that can satisfy a block allocation
> request. The reason is: gfs2 needs to allocate multiple blocks at a time
> for "worst case scenario" and none of your resource groups contain
> enough blocks for the "worst case".

Is there a paper which describes that "worst case"? I did not know about 
that allocation subtleties.

>
> A big part of the problem is that your file system uses the absolute
> minimum resource group size of 32MB (-r32 was used on mkfs.gfs2), and so
> there are 17847 of them, with minimal sized bitmaps. GFS2 cannot allocate
> the very last several blocks of a resource group because of the calculations
> used for worst case. Because your resource groups are so small, you're
> basically compounding the problem: it can't allocate blocks from a LOT
> of resource groups.

Heh, another person blindly copied parameters I usually use for very 
small filesystems. And we used that for tests in virtualized 
environments with limited space. As part of testing we tried to grow 
GFS2 and found that with quite big resource groups on small enough block 
devices we loose significant amount of space because remaining space is 
insufficient to fit one more rg. For example, with two 8MB journals, 
grow ~256+2*8 fs to ~300MB. If we had 128MB groups that was failing. 
with 32MB ones it succeeded.

Well, that just means that one size does not fit all.

>
> Normally, your file system should have bigger resource groups, and fewer
> of them. If you used a normal resource group size, like 128MB, or 256MB,
> or even 2048MB, a much higher percent of the file system would be usable
> because there were be fewer resource groups to cover the same area.
> Does that make sense?

Sure.
Is it safe enough to just drop that '-r' parameter from mkfs command 
line for production filesystems?

I suspect there will be attempts to migrate to a much bigger block 
devices (f.e. 1TB -> 20TB), but I'd do not concentrate on them now...

>
> If you do mkfs.gfs2 and specify -r512, you will be able to use much more
> of the file system, and it won't get into this problem until much later.

What could be the rule of thumb for prediction of such errors?
I mean at which point (in MB or %) we should start to care that we may 
get such error, depending on a rg size? Is there a point until which we 
definitely won't get them?


Thank you very much,

Vladislav

>
> In the past, I've actually looked into whether we can revise the
> calculations used by gfs2 for worst-case block allocations. I've still
> got some patches on my system for it. But even if we do it, it won't
> improve a lot, and it will take a long time to trickle out to customers.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>


From rpeterso at redhat.com  Thu Sep 10 13:37:25 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 10 Sep 2015 09:37:25 -0400 (EDT)
Subject: [Linux-cluster] GFS2 'No space left on device' while df reports
 1gb is free
In-Reply-To: <55F1739A.6070005@hoster-ok.com>
References: <55E6E823.6060302@hoster-ok.com>
	<1100302577.21379582.1441197409207.JavaMail.zimbra@redhat.com>
	<55F04114.7040309@hoster-ok.com>
	<336411051.25982805.1441830148096.JavaMail.zimbra@redhat.com>
	<55F1739A.6070005@hoster-ok.com>
Message-ID: <790690044.26295572.1441892245359.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Is there a paper which describes that "worst case"? I did not know about
> that allocation subtleties.

Unfortunately, no.

(snip) 
> Heh, another person blindly copied parameters I usually use for very
> small filesystems. And we used that for tests in virtualized
> environments with limited space. As part of testing we tried to grow
> GFS2 and found that with quite big resource groups on small enough block
> devices we loose significant amount of space because remaining space is
> insufficient to fit one more rg. For example, with two 8MB journals,
> grow ~256+2*8 fs to ~300MB. If we had 128MB groups that was failing.
> with 32MB ones it succeeded.
> 
> Well, that just means that one size does not fit all.

Indeed.

> Sure.
> Is it safe enough to just drop that '-r' parameter from mkfs command
> line for production filesystems?

Yes, as a rule using the default parameter for -r is best. You can also
get a performance problem if your resource group size is too big. I discovered
this and documented it in Bugzilla bug #1154782 (which may be private).
https://bugzilla.redhat.com/show_bug.cgi?id=1154782

Basically, if the rgrp size is max size (2GB), you will have 33 blocks of
bitmaps per rgrp. That corresponds to a LOT of page cache lookups for
every bitmap operation, which kills performance. I've written a patch that
fixes that, but it's not available yet, and when it is, will only be for
RHEL7.2 and above. So for now you have to find a happy medium between
too many rgrps with a loss of usable space, and rgrps that are too big, with
loss of performance. Using the default for -r is generally best.

> I suspect there will be attempts to migrate to a much bigger block
> devices (f.e. 1TB -> 20TB), but I'd do not concentrate on them now...
> 
> >
> > If you do mkfs.gfs2 and specify -r512, you will be able to use much more
> > of the file system, and it won't get into this problem until much later.
> 
> What could be the rule of thumb for prediction of such errors?
> I mean at which point (in MB or %) we should start to care that we may
> get such error, depending on a rg size? Is there a point until which we
> definitely won't get them?

Well, that's a sliding scale, and the calculations are messy.
(Hence the need to clean them up).

We always recommend implementing GFS2 in a test environment first
before putting it into production, so you can try these things to
see what works best for your use case.

Regards,

Bob Peterson
Red Hat File Systems


From tuckerd at lyle.smu.edu  Thu Sep 10 13:49:47 2015
From: tuckerd at lyle.smu.edu (Tucker, Doug)
Date: Thu, 10 Sep 2015 13:49:47 +0000
Subject: [Linux-cluster] cannot delete directories
Message-ID: <0E68E4F81C2E784CA8367421AE2D6CEF37140966@SXMB2PG.SYSTEMS.SMU.EDU>

Recently in an attempt to rm -Rf on a client that nfs mounts our RH6 
resilient storage with gfs2 file system to remove old users directories 
I have run into the problem where it deletes all content down the tree, 
leaving all directories empty but it cannot delete the directory because 
"the directory is not empty".  Even though it is.  Any ideas how to 
correct this?
-- 
Sincerely,

Doug Tucker


From rpeterso at redhat.com  Thu Sep 10 14:10:34 2015
From: rpeterso at redhat.com (Bob Peterson)
Date: Thu, 10 Sep 2015 10:10:34 -0400 (EDT)
Subject: [Linux-cluster] cannot delete directories
In-Reply-To: <0E68E4F81C2E784CA8367421AE2D6CEF37140966@SXMB2PG.SYSTEMS.SMU.EDU>
References: <0E68E4F81C2E784CA8367421AE2D6CEF37140966@SXMB2PG.SYSTEMS.SMU.EDU>
Message-ID: <1624700939.26322665.1441894234057.JavaMail.zimbra@redhat.com>

----- Original Message -----
> Recently in an attempt to rm -Rf on a client that nfs mounts our RH6
> resilient storage with gfs2 file system to remove old users directories
> I have run into the problem where it deletes all content down the tree,
> leaving all directories empty but it cannot delete the directory because
> "the directory is not empty".  Even though it is.  Any ideas how to
> correct this?
> --
> Sincerely,
> 
> Doug Tucker

Hi Doug,

Hm. That's odd.
Are you trying to rmdir through nfs? Or through the gfs2 server?
Sounds like you were using nfs, which ought to work.

1. Make sure lsof doesn't show any open files for that directory.
2. Make sure the directory itself isn't being exported via nfs.
3. Make sure there aren't hidden files via ls -a
4. Make sure there aren't any kernel errors on the GFS2 server in dmesg
5. Try doing rmdir on the gfs2 server to see if it works.

Regards,

Bob Peterson
Red Hat File Systems


From daniel.dehennin at baby-gnu.org  Fri Sep 11 14:02:37 2015
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Fri, 11 Sep 2015 16:02:37 +0200
Subject: [Linux-cluster] cLVM: LVM commands take severl minutes to complete
Message-ID: <87twr1giki.fsf@hati.baby-gnu.org>

Hello,

On a two node cluster Ubuntu Trusty:

- Linux nebula3 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59
  UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

- corosync 2.3.3-1ubuntu1

- pacemaker 1.1.10+git20130802-1ubuntu2.3

- dlm 4.0.1-0ubuntu1

- clvm 2.02.98-6ubuntu2

- gfs2-utils 3.1.6-0ubuntu1


The LVM commands take minutes to complete:

    root at nebula3:~# time vgs
      Error locking on node 40a8e784: Command timed out
      Error locking on node 40a8e784: Command timed out
      Error locking on node 40a8e784: Command timed out
      VG             #PV #LV #SN Attr   VSize    VFree
      nebula3-vg       1   4   0 wz--n-  133,52g       0
      one-fs           1   1   0 wz--nc    2,00t       0
      one-production   1   0   0 wz--nc 1023,50g 1023,50g

    real    5m40.233s
    user    0m0.005s
    sys     0m0.018s

Do you know where I can look to find what's going on?

Here are some informations:

    root at nebula3:~# corosync-quorumtool
    Quorum information
    ------------------
    Date:             Fri Sep 11 15:57:17 2015
    Quorum provider:  corosync_votequorum
    Nodes:            2
    Node ID:          1084811139
    Ring ID:          1460
    Quorate:          Yes

    Votequorum information
    ----------------------
    Expected votes:   2
    Highest expected: 2
    Total votes:      2
    Quorum:           1
    Flags:            2Node Quorate WaitForAll LastManStanding

    Membership information
    ----------------------
        Nodeid      Votes Name
    1084811139          1 192.168.231.131 (local)
    1084811140          1 192.168.231.132
    

    root at nebula3:~# dlm_tool ls
    dlm lockspaces
    name          datastores
    id            0x1b61ba6a
    flags         0x00000000
    change        member 2 joined 1 remove 0 failed 0 seq 1,1
    members       1084811139 1084811140

    name          clvmd
    id            0x4104eefa
    flags         0x00000000
    change        member 2 joined 1 remove 0 failed 0 seq 1,1
    members       1084811139 1084811140


    root at nebula3:~# dlm_tool status
    cluster nodeid 1084811139 quorate 1 ring seq 1460 1460
    daemon now 11026 fence_pid 0
    node 1084811139 M add 455 rem 0 fail 0 fence 0 at 0 0
    node 1084811140 M add 455 rem 0 fail 0 fence 0 at 0 0


Regards.

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150911/18f211b1/attachment.sig>

From bubble at hoster-ok.com  Fri Sep 11 15:28:22 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Fri, 11 Sep 2015 18:28:22 +0300
Subject: [Linux-cluster] cLVM: LVM commands take severl minutes to
 complete
In-Reply-To: <87twr1giki.fsf@hati.baby-gnu.org>
References: <87twr1giki.fsf@hati.baby-gnu.org>
Message-ID: <55F2F316.3040709@hoster-ok.com>

11.09.2015 17:02, Daniel Dehennin wrote:
> Hello,
>
> On a two node cluster Ubuntu Trusty:
>
> - Linux nebula3 3.13.0-63-generic #103-Ubuntu SMP Fri Aug 14 21:42:59
>    UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> - corosync 2.3.3-1ubuntu1
>
> - pacemaker 1.1.10+git20130802-1ubuntu2.3
>
> - dlm 4.0.1-0ubuntu1
>
> - clvm 2.02.98-6ubuntu2

You need newer version of this^

2.02.102 is known to include commit 431eda6 without which cluster is 
unusable in degraded state (and even if one node is put to standby state).

You see timeouts with two nodes online, so that is the different issue, 
but that above will not hurt.

>
> - gfs2-utils 3.1.6-0ubuntu1
>
>
> The LVM commands take minutes to complete:
>
>      root at nebula3:~# time vgs
>        Error locking on node 40a8e784: Command timed out
>        Error locking on node 40a8e784: Command timed out
>        Error locking on node 40a8e784: Command timed out
>        VG             #PV #LV #SN Attr   VSize    VFree
>        nebula3-vg       1   4   0 wz--n-  133,52g       0
>        one-fs           1   1   0 wz--nc    2,00t       0
>        one-production   1   0   0 wz--nc 1023,50g 1023,50g
>
>      real    5m40.233s
>      user    0m0.005s
>      sys     0m0.018s
>
> Do you know where I can look to find what's going on?
>
> Here are some informations:
>
>      root at nebula3:~# corosync-quorumtool
>      Quorum information
>      ------------------
>      Date:             Fri Sep 11 15:57:17 2015
>      Quorum provider:  corosync_votequorum
>      Nodes:            2
>      Node ID:          1084811139
>      Ring ID:          1460
>      Quorate:          Yes
>
>      Votequorum information
>      ----------------------
>      Expected votes:   2
>      Highest expected: 2
>      Total votes:      2
>      Quorum:           1
>      Flags:            2Node Quorate WaitForAll LastManStanding

Better use two_node: 1 in votequorum section.
That implies wait_for_all and supersedes last_man_standing for two-node 
clusters.

I'd also recommend to set clear_node_high_bit in totem section, do you 
use it?

But even better is to add nodelist section to corosync.conf with 
manually specified nodeid's.

Everything else looks fine...

>
>      Membership information
>      ----------------------
>          Nodeid      Votes Name
>      1084811139          1 192.168.231.131 (local)
>      1084811140          1 192.168.231.132
>
>
>      root at nebula3:~# dlm_tool ls
>      dlm lockspaces
>      name          datastores
>      id            0x1b61ba6a
>      flags         0x00000000
>      change        member 2 joined 1 remove 0 failed 0 seq 1,1
>      members       1084811139 1084811140
>
>      name          clvmd
>      id            0x4104eefa
>      flags         0x00000000
>      change        member 2 joined 1 remove 0 failed 0 seq 1,1
>      members       1084811139 1084811140
>
>
>      root at nebula3:~# dlm_tool status
>      cluster nodeid 1084811139 quorate 1 ring seq 1460 1460
>      daemon now 11026 fence_pid 0
>      node 1084811139 M add 455 rem 0 fail 0 fence 0 at 0 0
>      node 1084811140 M add 455 rem 0 fail 0 fence 0 at 0 0
>
>
> Regards.
>
>
>


From daniel.dehennin at baby-gnu.org  Fri Sep 11 17:10:50 2015
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Fri, 11 Sep 2015 19:10:50 +0200
Subject: [Linux-cluster] cLVM: LVM commands take severl minutes to
	complete
In-Reply-To: <55F2F316.3040709@hoster-ok.com> (Vladislav Bogdanov's message of
	"Fri, 11 Sep 2015 18:28:22 +0300")
References: <87twr1giki.fsf@hati.baby-gnu.org> <55F2F316.3040709@hoster-ok.com>
Message-ID: <87mvwshof9.fsf@hati.baby-gnu.org>

Vladislav Bogdanov <bubble at hoster-ok.com> writes:

> You need newer version of this^
>
> 2.02.102 is known to include commit 431eda6 without which cluster is
> unusable in degraded state (and even if one node is put to standby
> state).
>
> You see timeouts with two nodes online, so that is the different
> issue, but that above will not hurt.

Thanks for suggestion, I'll try to see what I can do. 

> Better use two_node: 1 in votequorum section.
> That implies wait_for_all and supersedes last_man_standing for
> two-node clusters.

Already done:

#+begin_src conf
quorum {
	# Quorum for the Pacemaker Cluster Resource Manager
	provider: corosync_votequorum
	# Number of bare metal hosts, VM are managed by pacemaker and
	# ?expected_votes? will increase when they get started
	expected_votes: 2

	# Two node mode
	two_node: 1

	# Pacemaker resources (so VMs) will not be started until
	# number of nodes is equal to ?expected_votes?
	wait_for_all: 1
	last_man_standing: 1
}
#+end_src

> I'd also recommend to set clear_node_high_bit in totem section, do you
> use it?

Yes.


> But even better is to add nodelist section to corosync.conf with
> manually specified nodeid's.

Already done, but without ids:

#+begin_src conf
nodelist {
	node {
		ring0_addr: 192.168.231.131
		name: nebula3
	}
	node {
		ring0_addr: 192.168.231.132
		name: nebula4
	}
}
#+end_src


> Everything else looks fine...

Thanks.

I wonder how to see where it fails before succeeding.

Regards.
-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150911/95330df9/attachment.sig>

From bubble at hoster-ok.com  Fri Sep 11 20:57:17 2015
From: bubble at hoster-ok.com (Vladislav Bogdanov)
Date: Fri, 11 Sep 2015 23:57:17 +0300
Subject: [Linux-cluster] cLVM: LVM commands take severl minutes
	to	complete
In-Reply-To: <87mvwshof9.fsf@hati.baby-gnu.org>
References: <87twr1giki.fsf@hati.baby-gnu.org> <55F2F316.3040709@hoster-ok.com>
	<87mvwshof9.fsf@hati.baby-gnu.org>
Message-ID: <4AB908B5-DBC1-410D-95C8-F43D54DF9717@hoster-ok.com>

11 ???????? 2015 ?. 20:10:50 GMT+03:00, Daniel Dehennin <daniel.dehennin at baby-gnu.org> ?????:
>Vladislav Bogdanov <bubble at hoster-ok.com> writes:
>
>> You need newer version of this^
>>
>> 2.02.102 is known to include commit 431eda6 without which cluster is
>> unusable in degraded state (and even if one node is put to standby
>> state).
>>
>> You see timeouts with two nodes online, so that is the different
>> issue, but that above will not hurt.
>
>Thanks for suggestion, I'll try to see what I can do. 
>
>> Better use two_node: 1 in votequorum section.
>> That implies wait_for_all and supersedes last_man_standing for
>> two-node clusters.
>
>Already done:
>
>#+begin_src conf
>quorum {
>	# Quorum for the Pacemaker Cluster Resource Manager
>	provider: corosync_votequorum
>	# Number of bare metal hosts, VM are managed by pacemaker and
>	# ?expected_votes? will increase when they get started
>	expected_votes: 2
>
>	# Two node mode
>	two_node: 1
>
>	# Pacemaker resources (so VMs) will not be started until
>	# number of nodes is equal to ?expected_votes?
>	wait_for_all: 1
>	last_man_standing: 1
>}
>#+end_src
>
>> I'd also recommend to set clear_node_high_bit in totem section, do
>you
>> use it?
>
>Yes.
>
>
>> But even better is to add nodelist section to corosync.conf with
>> manually specified nodeid's.
>
>Already done, but without ids:
>
>#+begin_src conf
>nodelist {
>	node {
>		ring0_addr: 192.168.231.131
>		name: nebula3
>	}
>	node {
>		ring0_addr: 192.168.231.132
>		name: nebula4
>	}
>}
>#+end_src
>
>
>> Everything else looks fine...
>
>Thanks.
>
>I wonder how to see where it fails before succeeding.
>
>Regards.

expected_votes is by default inherited from nodelist so you don't need it. last_man_standing is better to remove, it's not needed as well.

You can try to run clvmd off-cluster with debug to console and run lvm tools also with debug to get a picture. Please ping me after holydays if you need help on how to do that.


From daniel.dehennin at baby-gnu.org  Wed Sep 16 14:50:32 2015
From: daniel.dehennin at baby-gnu.org (Daniel Dehennin)
Date: Wed, 16 Sep 2015 16:50:32 +0200
Subject: [Linux-cluster] cLVM: LVM commands take severl minutes
	to	complete
In-Reply-To: <4AB908B5-DBC1-410D-95C8-F43D54DF9717@hoster-ok.com> (Vladislav
	Bogdanov's message of "Fri, 11 Sep 2015 23:57:17 +0300")
References: <87twr1giki.fsf@hati.baby-gnu.org>
	<55F2F316.3040709@hoster-ok.com> <87mvwshof9.fsf@hati.baby-gnu.org>
	<4AB908B5-DBC1-410D-95C8-F43D54DF9717@hoster-ok.com>
Message-ID: <87pp1ifmfb.fsf@hati.baby-gnu.org>

Vladislav Bogdanov <bubble at hoster-ok.com> writes:

> expected_votes is by default inherited from nodelist so you don't need
> it. last_man_standing is better to remove, it's not needed as well.
>
> You can try to run clvmd off-cluster with debug to console and run lvm
> tools also with debug to get a picture. Please ping me after holydays
> if you need help on how to do that.

Thanks.

The cluster is in production, so I can do many things.

I ran ?clvmd -S? and it makes it work again.

Then I could extend my VG, my LV and grow my GFS2, but several minutes
later, I had a kernel panic:

Sep 16 15:46:28 nebula3 kernel: [442791.286867] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
Sep 16 15:46:28 nebula3 kernel: [442791.293096] IP: [<ffffffffa05d613c>] gfs2_rbm_find+0xac/0x530 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.296507] PGD 0
Sep 16 15:46:28 nebula3 kernel: [442791.299815] Oops: 0000 [#1] SMP
Sep 16 15:46:28 nebula3 kernel: [442791.303098] Modules linked in: vhost_net vhost macvtap macvlan gfs2 dlm sctp configfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables openvswitch gre vxlan ip_tunnel nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache bonding scsi_dh_emc dm_round_robin ipmi_devintf gpio_ich dcdbas x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper joydev dm_multipath ablk_helper scsi_dh cryptd sb_edac edac_core shpchp mei_me mei lpc_ich mac_hid ipmi_si acpi_power_meter wmi iTCO_wdt iTCO_vendor_support hid_generic usbhid hid ses enclosure qla2xxx ahci libahci scsi_transport_fc bnx2x tg3 scsi_tgt ptp pps_core megaraid_sas mdio libcrc32c
Sep 16 15:46:28 nebula3 kernel: [442791.343434] CPU: 14 PID: 27504 Comm: qemu-system-x86 Tainted: G        W     3.13.0-63-generic #103-Ubuntu
Sep 16 15:46:28 nebula3 kernel: [442791.352841] Hardware name: Dell Inc. PowerEdge M620/0T36VK, BIOS 2.2.7 01/21/2014
Sep 16 15:46:28 nebula3 kernel: [442791.362631] task: ffff880035ddc800 ti: ffff8801303ce000 task.ti: ffff8801303ce000
Sep 16 15:46:28 nebula3 kernel: [442791.372920] RIP: 0010:[<ffffffffa05d613c>]  [<ffffffffa05d613c>] gfs2_rbm_find+0xac/0x530 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.383625] RSP: 0018:ffff8801303cfae8  EFLAGS: 00010246
Sep 16 15:46:28 nebula3 kernel: [442791.389101] RAX: 0000000000000080 RBX: ffff8801303cfbd0 RCX: ffff880bf4def6e8
Sep 16 15:46:28 nebula3 kernel: [442791.400133] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
Sep 16 15:46:28 nebula3 kernel: [442791.411734] RBP: ffff8801303cfb60 R08: 0000000000000001 R09: ffff8800b8d5b850
Sep 16 15:46:28 nebula3 kernel: [442791.423628] R10: 0000000000020328 R11: 000000000005e0c7 R12: 0000000000000000
Sep 16 15:46:28 nebula3 kernel: [442791.435582] R13: ffffffffffffffff R14: ffff8800b8d5b850 R15: ffff880bdf594000
Sep 16 15:46:28 nebula3 kernel: [442791.447748] FS:  00007f22157fa700(0000) GS:ffff880c0fae0000(0000) knlGS:0000000000000000
Sep 16 15:46:28 nebula3 kernel: [442791.460205] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 15:46:28 nebula3 kernel: [442791.466482] CR2: 0000000000000028 CR3: 00000001698f4000 CR4: 00000000001427e0
Sep 16 15:46:28 nebula3 kernel: [442791.478799] Stack:
Sep 16 15:46:28 nebula3 kernel: [442791.484741]  ffff88017966dbf8 ffff8801303cfb10 ffffffffa05c098e 000000002934a8b0
Sep 16 15:46:28 nebula3 kernel: [442791.496710]  ffff88112934a860 0000000000000005 0000000000012190 ffff880bed95b000
Sep 16 15:46:28 nebula3 kernel: [442791.508660]  0000000000000000 ffff88112934a890 0000000023dd91c3 0000000000000000
Sep 16 15:46:28 nebula3 kernel: [442791.520688] Call Trace:
Sep 16 15:46:28 nebula3 kernel: [442791.526551]  [<ffffffffa05c098e>] ? gfs2_glock_wait+0x3e/0x80 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.532464]  [<ffffffffa05d8089>] gfs2_inplace_reserve+0x459/0x9e0 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.538327]  [<ffffffffa05c8a2c>] gfs2_write_begin+0x20c/0x470 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.544070]  [<ffffffff8114f888>] generic_file_buffered_write+0xf8/0x250
Sep 16 15:46:28 nebula3 kernel: [442791.549757]  [<ffffffff81150f51>] __generic_file_aio_write+0x1c1/0x3d0
Sep 16 15:46:28 nebula3 kernel: [442791.555366]  [<ffffffff811511b8>] generic_file_aio_write+0x58/0xa0
Sep 16 15:46:28 nebula3 kernel: [442791.560909]  [<ffffffffa05ca3d9>] gfs2_file_aio_write+0xb9/0x150 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.566473]  [<ffffffff8108e720>] ? hrtimer_get_res+0x50/0x50
Sep 16 15:46:28 nebula3 kernel: [442791.571918]  [<ffffffff811bdc9a>] do_sync_write+0x5a/0x90
Sep 16 15:46:28 nebula3 kernel: [442791.577235]  [<ffffffff811be424>] vfs_write+0xb4/0x1f0
Sep 16 15:46:28 nebula3 kernel: [442791.582459]  [<ffffffff811befd2>] SyS_pwrite64+0x72/0xb0
Sep 16 15:46:28 nebula3 kernel: [442791.587547]  [<ffffffff8173489d>] system_call_fastpath+0x1a/0x1f
Sep 16 15:46:28 nebula3 kernel: [442791.592576] Code: 34 0f 8d 1d 03 00 00 48 8b 0b 8b 53 0c 48 63 c2 48 8d 34 80 48 8b 41 58 4c 8d 3c f0 49 8b 47 10 a8 02 75 ab 49 8b 17 41 8b 47 18 <48> 03 42 28 48 8b 12 83 e2 01 0f 84 50 04 00 00 80 7c 24 33 02
Sep 16 15:46:28 nebula3 kernel: [442791.607762] RIP  [<ffffffffa05d613c>] gfs2_rbm_find+0xac/0x530 [gfs2]
Sep 16 15:46:28 nebula3 kernel: [442791.612680]  RSP <ffff8801303cfae8>
Sep 16 15:46:28 nebula3 kernel: [442791.617436] CR2: 0000000000000028
Sep 16 15:46:28 nebula3 kernel: [442791.628730] ---[ end trace 0f6f4a48b58f5fb0 ]---


After a reboot of the hardware and starting the pacemaker stack, it's running.

I just loose some VMs in transient states.

Regards.

-- 
Daniel Dehennin
R?cup?rer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 342 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20150916/ef501506/attachment.sig>