From neale at sinenomine.net  Wed Aug 20 04:45:12 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Wed, 20 Aug 2014 04:45:12 +0000
Subject: [Linux-cluster] clvmd not terminating
Message-ID: <6A0A9B9D-E287-468C-B784-3F0A8DAB4E23@sinenomine.net>

We have a sporadic situation where we are attempting to shutdown/restart both nodes of a two node cluster. One shutdowns completely but one sometimes hangs with:

[root at aude2mq036nabzi ~]# service cman stop
Stopping cluster:
   Leaving fence domain... found dlm lockspace /sys/kernel/dlm/clvmd
fence_tool: cannot leave due to active systems
[FAILED]

When the other node is brought back up it has problems with clvmd:

># pvscan
  connect() failed on local socket: Connection refused
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.

Sometimes it works fine but very occasionally we get the above situation. I've encountered the fence message before, usually when the fence devices were incorrectly configured but it would always fail because of this. Before I get too far into investigation mode I wondered if the above symptoms ring any bells for anyone.

Neale



From ricks at alldigital.com  Wed Aug 20 05:04:56 2014
From: ricks at alldigital.com (ricks)
Date: Tue, 19 Aug 2014 22:04:56 -0700
Subject: [Linux-cluster] clvmd not terminating
Message-ID: <r7ytrr41hwxiir22jyobpqmr.1408511096413@email.android.com>

Just issued. Should take 10-20 minutes to go through.?


Sent from my Verizon Wireless 4G LTE smartphone

<div>-------- Original message --------</div><div>From: Neale Ferguson <neale at sinenomine.net> </div><div>Date:08/19/2014  9:45 PM  (GMT-07:00) </div><div>To: linux clustering <linux-cluster at redhat.com> </div><div>Subject: [Linux-cluster] clvmd not terminating </div><div>
</div>We have a sporadic situation where we are attempting to shutdown/restart both nodes of a two node cluster. One shutdowns completely but one sometimes hangs with:

[root at aude2mq036nabzi ~]# service cman stop
Stopping cluster:
   Leaving fence domain... found dlm lockspace /sys/kernel/dlm/clvmd
fence_tool: cannot leave due to active systems
[FAILED]

When the other node is brought back up it has problems with clvmd:

># pvscan
  connect() failed on local socket: Connection refused
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.

Sometimes it works fine but very occasionally we get the above situation. I've encountered the fence message before, usually when the fence devices were incorrectly configured but it would always fail because of this. Before I get too far into investigation mode I wondered if the above symptoms ring any bells for anyone.

Neale

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140819/782e3ecb/attachment.htm>

From ricks at alldigital.com  Wed Aug 20 05:07:16 2014
From: ricks at alldigital.com (ricks)
Date: Tue, 19 Aug 2014 22:07:16 -0700
Subject: [Linux-cluster] clvmd not terminating
Message-ID: <guopjf57t8r0ygaycyb82dbr.1408511236324@email.android.com>

Please ignore my last post. Ruddy phone slid a new message in.


Sent from my Verizon Wireless 4G LTE smartphone

<div>-------- Original message --------</div><div>From: Neale Ferguson <neale at sinenomine.net> </div><div>Date:08/19/2014  9:45 PM  (GMT-07:00) </div><div>To: linux clustering <linux-cluster at redhat.com> </div><div>Subject: [Linux-cluster] clvmd not terminating </div><div>
</div>We have a sporadic situation where we are attempting to shutdown/restart both nodes of a two node cluster. One shutdowns completely but one sometimes hangs with:

[root at aude2mq036nabzi ~]# service cman stop
Stopping cluster:
   Leaving fence domain... found dlm lockspace /sys/kernel/dlm/clvmd
fence_tool: cannot leave due to active systems
[FAILED]

When the other node is brought back up it has problems with clvmd:

># pvscan
  connect() failed on local socket: Connection refused
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.

Sometimes it works fine but very occasionally we get the above situation. I've encountered the fence message before, usually when the fence devices were incorrectly configured but it would always fail because of this. Before I get too far into investigation mode I wondered if the above symptoms ring any bells for anyone.

Neale

-- 
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140819/6e77aec7/attachment.htm>

From wferi at niif.hu  Fri Aug 22 00:37:44 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 22 Aug 2014 02:37:44 +0200
Subject: [Linux-cluster] on exiting maintenance mode
Message-ID: <8761hlickn.fsf@lant.ki.iif.hu>

Hi,

While my Pacemaker cluster was in maintenance mode, resources were moved
(by hand) between the nodes as I rebooted each node in turn.  In the end
the crm status output became perfectly empty, as the reboot of a given
node removed from the output the resources which were located on the
rebooted node at the time of entering maintenance mode.  I expected full
resource discovery on exiting maintenance mode, but it probably did not
happen, as the cluster started up resources already running on other
nodes, which is generally forbidden.  Given that all resources were
running (though possibly migrated during the maintenance), what would
have been the correct way of bringing the cluster out of maintenance
mode?  This should have required no resource actions at all.  Would
cleanup of all resources have helped?  Or is there a better way?
-- 
Thanks,
Feri.



From vasil.val at gmail.com  Tue Aug 26 06:56:29 2014
From: vasil.val at gmail.com (Vasil Valchev)
Date: Tue, 26 Aug 2014 09:56:29 +0300
Subject: [Linux-cluster] totem token & post_fail_delay question
Message-ID: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>

Hello,

I have a cluster that sometimes has intermittent network issues on the
heartbeat network.
Unfortunately improving the network is not an option, so I am looking for a
way to tolerate longer interruptions.

Previously it seemed to me the post_fail_delay option is suitable, but
after some research it might not be what I am looking for.

If I am correct, when a member leaves (due to token timeout) the cluster
will wait the post_fail_delay before fencing. If the member rejoins before
that, it will still be fenced, because it has previous state?
>From a recent fencing on this cluster there is a strange message:

Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
despite it rejoining the cluster with existing state, it has a lower node ID

What does this mean?

And lastly is increasing the totem token timeout the way to go?


Thanks,
Vasil Valchev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140826/4163f619/attachment.htm>

From andrew at beekhof.net  Tue Aug 26 07:40:50 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Tue, 26 Aug 2014 17:40:50 +1000
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <8761hlickn.fsf@lant.ki.iif.hu>
References: <8761hlickn.fsf@lant.ki.iif.hu>
Message-ID: <67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>


On 22 Aug 2014, at 10:37 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Hi,
> 
> While my Pacemaker cluster was in maintenance mode, resources were moved
> (by hand) between the nodes as I rebooted each node in turn.  In the end
> the crm status output became perfectly empty, as the reboot of a given
> node removed from the output the resources which were located on the
> rebooted node at the time of entering maintenance mode.  I expected full
> resource discovery on exiting maintenance mode,

Version and logs?

The discovery usually happens at the point the cluster is started on a node.
Maintenance mode just prevents the cluster from doing anything about it.

> but it probably did not
> happen, as the cluster started up resources already running on other
> nodes, which is generally forbidden.  Given that all resources were
> running (though possibly migrated during the maintenance), what would
> have been the correct way of bringing the cluster out of maintenance
> mode?  This should have required no resource actions at all.  Would
> cleanup of all resources have helped?  Or is there a better way?
> -- 
> Thanks,
> Feri.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140826/dd66ebe8/attachment.sig>

From emi2fast at gmail.com  Tue Aug 26 08:11:39 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 26 Aug 2014 10:11:39 +0200
Subject: [Linux-cluster] totem token & post_fail_delay question
In-Reply-To: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>
References: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>
Message-ID: <CAE7pJ3BsQtgeO9pHquqtA_STd_HCTh=RXTrNEMFNQzRSM2A0Tg@mail.gmail.com>

from man fenced

Post-fail delay is the number of seconds the daemon will wait before
fencing any victims after a domain member fails.

It's used for delay the fence action.

2014-08-26 8:56 GMT+02:00 Vasil Valchev <vasil.val at gmail.com>:
> Hello,
>
> I have a cluster that sometimes has intermittent network issues on the
> heartbeat network.
> Unfortunately improving the network is not an option, so I am looking for a
> way to tolerate longer interruptions.
>
> Previously it seemed to me the post_fail_delay option is suitable, but after
> some research it might not be what I am looking for.
>
> If I am correct, when a member leaves (due to token timeout) the cluster
> will wait the post_fail_delay before fencing. If the member rejoins before
> that, it will still be fenced, because it has previous state?
> From a recent fencing on this cluster there is a strange message:
>
> Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
> despite it rejoining the cluster with existing state, it has a lower node ID
>
> What does this mean?
>
> And lastly is increasing the totem token timeout the way to go?
>
>
> Thanks,
> Vasil Valchev
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
esta es mi vida e me la vivo hasta que dios quiera



From ccaulfie at redhat.com  Tue Aug 26 08:23:14 2014
From: ccaulfie at redhat.com (Christine Caulfield)
Date: Tue, 26 Aug 2014 09:23:14 +0100
Subject: [Linux-cluster] totem token & post_fail_delay question
In-Reply-To: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>
References: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>
Message-ID: <53FC43F2.2010003@redhat.com>

On 26/08/14 07:56, Vasil Valchev wrote:
> Hello,
>
> I have a cluster that sometimes has intermittent network issues on the
> heartbeat network.
> Unfortunately improving the network is not an option, so I am looking
> for a way to tolerate longer interruptions.
>
> Previously it seemed to me the post_fail_delay option is suitable, but
> after some research it might not be what I am looking for.
>
> If I am correct, when a member leaves (due to token timeout) the cluster
> will wait the post_fail_delay before fencing. If the member rejoins
> before that, it will still be fenced, because it has previous state?
>  From a recent fencing on this cluster there is a strange message:
>
> Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
> despite it rejoining the cluster with existing state, it has a lower node ID
>
> What does this mean?
>

It's an attempt by cman to sort out which node to kill in the situation 
where a node rejoins too quickly. If both nodes try to send a 'kill' 
message then then both nodes would leave the cluster leaving you with no 
active nodes. So cman (and fencing) prioritise the node with the lowest 
nodeID in an attempt at a tie-break. you should see a corresponding 
message on the other node:
"Killing node %s because it has rejoined the cluster with existing state 
and has higher node ID"


> And lastly is increasing the totem token timeout the way to go?
>

if there is no option for improving the network situation then, yes, 
increasing token timeout is probably your best option.

Chrissie



From emi2fast at gmail.com  Tue Aug 26 10:08:48 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Tue, 26 Aug 2014 12:08:48 +0200
Subject: [Linux-cluster] totem token & post_fail_delay question
In-Reply-To: <53FC43F2.2010003@redhat.com>
References: <CAFZxf=L12UCn6nEnd_LtRWL6_P=wOALfArR6DxhL9RU5iA2Tnw@mail.gmail.com>
	<53FC43F2.2010003@redhat.com>
Message-ID: <CAE7pJ3Ae29bUz4CRo41Lv+8CSUD9cQC35zr2mP0A__Bvu40eZQ@mail.gmail.com>

i think, you are talking about:

 Post-join delay is the number of seconds the daemon will wait before
fencing any victims after a node joins the domain.



2014-08-26 10:23 GMT+02:00 Christine Caulfield <ccaulfie at redhat.com>:
> On 26/08/14 07:56, Vasil Valchev wrote:
>>
>> Hello,
>>
>> I have a cluster that sometimes has intermittent network issues on the
>> heartbeat network.
>> Unfortunately improving the network is not an option, so I am looking
>> for a way to tolerate longer interruptions.
>>
>> Previously it seemed to me the post_fail_delay option is suitable, but
>> after some research it might not be what I am looking for.
>>
>> If I am correct, when a member leaves (due to token timeout) the cluster
>> will wait the post_fail_delay before fencing. If the member rejoins
>> before that, it will still be fenced, because it has previous state?
>>  From a recent fencing on this cluster there is a strange message:
>>
>> Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
>> despite it rejoining the cluster with existing state, it has a lower node
>> ID
>>
>> What does this mean?
>>
>
> It's an attempt by cman to sort out which node to kill in the situation
> where a node rejoins too quickly. If both nodes try to send a 'kill' message
> then then both nodes would leave the cluster leaving you with no active
> nodes. So cman (and fencing) prioritise the node with the lowest nodeID in
> an attempt at a tie-break. you should see a corresponding message on the
> other node:
> "Killing node %s because it has rejoined the cluster with existing state and
> has higher node ID"
>
>
>
>> And lastly is increasing the totem token timeout the way to go?
>>
>
> if there is no option for improving the network situation then, yes,
> increasing token timeout is probably your best option.
>
> Chrissie
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster



-- 
esta es mi vida e me la vivo hasta que dios quiera



From wferi at niif.hu  Tue Aug 26 17:40:19 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 26 Aug 2014 19:40:19 +0200
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <67506B71-8594-4C16-82C1-F94779F59826@beekhof.net> (Andrew
	Beekhof's message of "Tue, 26 Aug 2014 17:40:50 +1000")
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
Message-ID: <87iolf9mkc.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 22 Aug 2014, at 10:37 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> While my Pacemaker cluster was in maintenance mode, resources were moved
>> (by hand) between the nodes as I rebooted each node in turn.  In the end
>> the crm status output became perfectly empty, as the reboot of a given
>> node removed from the output the resources which were located on the
>> rebooted node at the time of entering maintenance mode.  I expected full
>> resource discovery on exiting maintenance mode,
>
> Version and logs?

(The more interesting part comes later, please skip to the theoretical
part if you're short on time. :)

I left those out, as I don't expect the actual behavior to be a bug.
But I experienced this with Pacemaker version 1.1.7.  I know it's old
and it suffers from crmd segfault on entering maintenance mode (cf.
http://thread.gmane.org/gmane.linux.highavailability.user/39121), but
works well generally so I did not get to upgrade it yet.  Now that I
mentioned the crmd segfault: I noted that it died on the DC when I
entered maintenance mode:

crmd: [7452]: info: te_rsc_command: Initiating action 64: cancel vm-tmvp_monitor_60000 on n01 (local)
crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
crmd: [7452]: ERROR: get_lrm_resource: Could not add resource vm-tmvp to LRM
crmd: [7452]: ERROR: do_lrm_invoke: Invalid resource definition
crmd: [7452]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1408517719-30820" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="n01" >
crmd: [7452]: WARN: do_lrm_invoke: bad input   <crm_xml >
crmd: [7452]: WARN: do_lrm_invoke: bad input     <rsc_op id="64" operation="cancel" operation_key="vm-tmvp_monitor_60000" on_node="n01" on_node_uuid="n01" transition-key="64:20579:0:1b0a6e79-af5a-41e4-8ced-299371e7922c" >
crmd: [7452]: WARN: do_lrm_invoke: bad input       <primitive id="vm-tmvp" long-id="vm-tmvp" class="ocf" provider="niif" type="TransientDomain" />
crmd: [7452]: info: te_rsc_command: Initiating action 86: cancel vm-wfweb_monitor_60000 on n01 (local)
crmd: [7452]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
corosync[6966]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x1dc6ea0, async-conn=0x1dc6ea0) left
pacemakerd: [7443]: WARN: Managed crmd process 7452 killed by signal 11 [SIGSEGV - Segmentation violation].
pacemakerd: [7443]: notice: pcmk_child_exit: Child process crmd terminated with signal 11 (pid=7452, rc=0)

However, it got restarted seamlessly, without the node being fenced, so
I did not even notice this until now.  Should this have resulted in the
node being fenced?

But back to the issue at hand.  The Pacemaker shutdown seemed normal,
apart from the bunch of messages like:

crmd: [13794]: ERROR: verify_stopped: Resource vm-web5 was active at shutdown.  You may ignore this error if it is unmanaged.

appearing twice and warnings like:

cib: [7447]: WARN: send_ipc_message: IPC Channel to 13794 is not connected
cib: [7447]: WARN: send_via_callback_channel: Delivery of reply to client 13794/bf6f43a2-70db-40ac-a902-eabc3c12e20d failed
cib: [7447]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
corosync[6966]:   [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)

On reboot, corosync complained until the some Pacemaker components
started:

corosync[8461]:   [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
corosync[8461]:   [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)

Pacemaker then probed the resources on the local node (all was inactive):

lrmd: [8946]: info: rsc:stonith-n01 probe[5] (pid 9081)
lrmd: [8946]: info: rsc:dlm:0 probe[6] (pid 9082)
[...]
lrmd: [8946]: info: operation monitor[112] on vm-fir for client 8949: pid 12015 exited with return code 7
crmd: [8949]: info: process_lrm_event: LRM operation vm-fir_monitor_0 (call=112, rc=7, cib-update=130, confirmed=true) not running
attrd: [8947]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
attrd: [8947]: notice: attrd_perform_update: Sent update 4: probe_complete=true

Then I cleaned up some resources running on other nodes, which resulted
in those showing up in the crm status output providing log lines like eg.:

crmd: [8949]: WARN: status_from_rc: Action 4 (vm-web5_monitor_0) on n02 failed (target: 7 vs. rc: 0): Error

Finally, I exited maintenance mode, and Pacemaker started every resource
I did not clean up beforehand, concurrently with their already running
instances:

pengine: [8948]: notice: LogActions: Start   vm-web9#011(n03)

I can provide more logs if this behavior is indeed unexpected, but it
looks more like I miss the exact concept of maintenance mode.

> The discovery usually happens at the point the cluster is started on a node.

A local discovery did happen, but it could not find anything, as the
cluster was started by the init scripts, well before any resource could
have been moved to the freshly rebooted node (manually, to free the next
node for rebooting).

> Maintenance mode just prevents the cluster from doing anything about it.

Fine.  So I should have restarted Pacemaker on each node before leaving
maintenance mode, right?  Or is there a better way?  (Unfortunately, I
could not manage the rolling reboot through Pacemaker, as some DLM/cLVM
freeze made the cluster inoperable in its normal way.)

>> but it probably did not happen, as the cluster started up resources
>> already running on other nodes, which is generally forbidden.  Given
>> that all resources were running (though possibly migrated during the
>> maintenance), what would have been the correct way of bringing the
>> cluster out of maintenance mode?  This should have required no
>> resource actions at all.  Would cleanup of all resources have helped?
>> Or is there a better way?

You say in the above thread that resource definitions can be changed:
http://thread.gmane.org/gmane.linux.highavailability.user/39121/focus=39437
Let me quote from there (starting with the words of Ulrich Windl):

>>>> I think it's a common misconception that you can modify cluster
>>>> resources while in maintenance mode:
>>> 
>>> No, you _should_ be able to.  If that's not the case, its a bug.
>> 
>> So the end of maintenance mode starts with a "re-probe"?
>
> No, but it doesn't need to.  
> The policy engine already knows if the resource definitions changed
> and the recurring monitor ops will find out if any are not running.

My experiences show that you may not *move around* resources while in
maintenance mode.  That would indeed require a cluster-wide re-probe,
which does not seem to happen (unless forced some way).  Probably there
was some misunderstanding in the above discussion, I guess Ulrich meant
moving resources when he wrote "modifying cluster resources".  Does this
make sense?
-- 
Thanks,
Feri.



From wferi at niif.hu  Tue Aug 26 20:42:07 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Tue, 26 Aug 2014 22:42:07 +0200
Subject: [Linux-cluster] locating a starting resource
Message-ID: <871ts39e5c.fsf@lant.ki.iif.hu>

Hi,

crm_resource --locate finds the hosting node of a running (successfully
started) resource just fine.  Is there a way to similarly find out the
location of a resource *being* started, ie. whose resource agent is
already running the start action, but that action is not finished yet?
-- 
Thanks,
Feri.



From andrew at beekhof.net  Tue Aug 26 23:46:18 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 27 Aug 2014 09:46:18 +1000
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <871ts39e5c.fsf@lant.ki.iif.hu>
References: <871ts39e5c.fsf@lant.ki.iif.hu>
Message-ID: <EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>


On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Hi,
> 
> crm_resource --locate finds the hosting node of a running (successfully
> started) resource just fine.  Is there a way to similarly find out the
> location of a resource *being* started, ie. whose resource agent is
> already running the start action, but that action is not finished yet?

You need to set record-pending=true in the op_defaults section.
For some reason this is not yet documented :-/

With this in place, crm_resource will find the correct location


> -- 
> Thanks,
> Feri.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140827/51565f46/attachment.sig>

From andrew at beekhof.net  Wed Aug 27 04:54:33 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Wed, 27 Aug 2014 14:54:33 +1000
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <87iolf9mkc.fsf@lant.ki.iif.hu>
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
	<87iolf9mkc.fsf@lant.ki.iif.hu>
Message-ID: <F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net>


On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Andrew Beekhof <andrew at beekhof.net> writes:
> 
>> On 22 Aug 2014, at 10:37 am, Ferenc Wagner <wferi at niif.hu> wrote:
>> 
>>> While my Pacemaker cluster was in maintenance mode, resources were moved
>>> (by hand) between the nodes as I rebooted each node in turn.  In the end
>>> the crm status output became perfectly empty, as the reboot of a given
>>> node removed from the output the resources which were located on the
>>> rebooted node at the time of entering maintenance mode.  I expected full
>>> resource discovery on exiting maintenance mode,
>> 
>> Version and logs?
> 
> (The more interesting part comes later, please skip to the theoretical
> part if you're short on time. :)
> 
> I left those out, as I don't expect the actual behavior to be a bug.
> But I experienced this with Pacemaker version 1.1.7.  I know it's old

No kidding :)

> and it suffers from crmd segfault on entering maintenance mode (cf.
> http://thread.gmane.org/gmane.linux.highavailability.user/39121), but
> works well generally so I did not get to upgrade it yet.  Now that I
> mentioned the crmd segfault: I noted that it died on the DC when I
> entered maintenance mode:
> 
> crmd: [7452]: info: te_rsc_command: Initiating action 64: cancel vm-tmvp_monitor_60000 on n01 (local)
> crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.

That looks like the lrmd died.

> crmd: [7452]: ERROR: get_lrm_resource: Could not add resource vm-tmvp to LRM
> crmd: [7452]: ERROR: do_lrm_invoke: Invalid resource definition
> crmd: [7452]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1408517719-30820" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="n01" >
> crmd: [7452]: WARN: do_lrm_invoke: bad input   <crm_xml >
> crmd: [7452]: WARN: do_lrm_invoke: bad input     <rsc_op id="64" operation="cancel" operation_key="vm-tmvp_monitor_60000" on_node="n01" on_node_uuid="n01" transition-key="64:20579:0:1b0a6e79-af5a-41e4-8ced-299371e7922c" >
> crmd: [7452]: WARN: do_lrm_invoke: bad input       <primitive id="vm-tmvp" long-id="vm-tmvp" class="ocf" provider="niif" type="TransientDomain" />
> crmd: [7452]: info: te_rsc_command: Initiating action 86: cancel vm-wfweb_monitor_60000 on n01 (local)
> crmd: [7452]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
> crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
> corosync[6966]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x1dc6ea0, async-conn=0x1dc6ea0) left
> pacemakerd: [7443]: WARN: Managed crmd process 7452 killed by signal 11 [SIGSEGV - Segmentation violation].

Which created a condition in the crmd that it couldn't handle so it crashed too.

> pacemakerd: [7443]: notice: pcmk_child_exit: Child process crmd terminated with signal 11 (pid=7452, rc=0)
> 
> However, it got restarted seamlessly, without the node being fenced, so
> I did not even notice this until now.  Should this have resulted in the
> node being fenced?

Depends how fast the node can respawn.

> 
> But back to the issue at hand.  The Pacemaker shutdown seemed normal,
> apart from the bunch of messages like:
> 
> crmd: [13794]: ERROR: verify_stopped: Resource vm-web5 was active at shutdown.  You may ignore this error if it is unmanaged.

In maintenance mode, everything is unmanaged. So that would be expected.

> 
> appearing twice and warnings like:
> 
> cib: [7447]: WARN: send_ipc_message: IPC Channel to 13794 is not connected
> cib: [7447]: WARN: send_via_callback_channel: Delivery of reply to client 13794/bf6f43a2-70db-40ac-a902-eabc3c12e20d failed
> cib: [7447]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
> corosync[6966]:   [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> 
> On reboot, corosync complained until the some Pacemaker components
> started:
> 
> corosync[8461]:   [pcmk  ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
> corosync[8461]:   [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> 
> Pacemaker then probed the resources on the local node (all was inactive):
> 
> lrmd: [8946]: info: rsc:stonith-n01 probe[5] (pid 9081)
> lrmd: [8946]: info: rsc:dlm:0 probe[6] (pid 9082)
> [...]
> lrmd: [8946]: info: operation monitor[112] on vm-fir for client 8949: pid 12015 exited with return code 7
> crmd: [8949]: info: process_lrm_event: LRM operation vm-fir_monitor_0 (call=112, rc=7, cib-update=130, confirmed=true) not running
> attrd: [8947]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> attrd: [8947]: notice: attrd_perform_update: Sent update 4: probe_complete=true
> 
> Then I cleaned up some resources running on other nodes, which resulted
> in those showing up in the crm status output providing log lines like eg.:
> 
> crmd: [8949]: WARN: status_from_rc: Action 4 (vm-web5_monitor_0) on n02 failed (target: 7 vs. rc: 0): Error
> 
> Finally, I exited maintenance mode, and Pacemaker started every resource
> I did not clean up beforehand, concurrently with their already running
> instances:
> 
> pengine: [8948]: notice: LogActions: Start   vm-web9#011(n03)
> 
> I can provide more logs if this behavior is indeed unexpected, but it
> looks more like I miss the exact concept of maintenance mode.
> 
>> The discovery usually happens at the point the cluster is started on a node.
> 
> A local discovery did happen, but it could not find anything, as the
> cluster was started by the init scripts, well before any resource could
> have been moved to the freshly rebooted node (manually, to free the next
> node for rebooting).

Thats your problem then, you've started resources outside of the control of the cluster.
Two options... recurring monitor actions with role=Stopped would have caught this or you can run crm_resource --cleanup after you've moved resources around.

> 
>> Maintenance mode just prevents the cluster from doing anything about it.
> 
> Fine.  So I should have restarted Pacemaker on each node before leaving
> maintenance mode, right?  Or is there a better way?

See above

>  (Unfortunately, I
> could not manage the rolling reboot through Pacemaker, as some DLM/cLVM
> freeze made the cluster inoperable in its normal way.)
> 
>>> but it probably did not happen, as the cluster started up resources
>>> already running on other nodes, which is generally forbidden.  Given
>>> that all resources were running (though possibly migrated during the
>>> maintenance), what would have been the correct way of bringing the
>>> cluster out of maintenance mode?  This should have required no
>>> resource actions at all.  Would cleanup of all resources have helped?
>>> Or is there a better way?
> 
> You say in the above thread that resource definitions can be changed:
> http://thread.gmane.org/gmane.linux.highavailability.user/39121/focus=39437
> Let me quote from there (starting with the words of Ulrich Windl):
> 
>>>>> I think it's a common misconception that you can modify cluster
>>>>> resources while in maintenance mode:
>>>> 
>>>> No, you _should_ be able to.  If that's not the case, its a bug.
>>> 
>>> So the end of maintenance mode starts with a "re-probe"?
>> 
>> No, but it doesn't need to.  
>> The policy engine already knows if the resource definitions changed
>> and the recurring monitor ops will find out if any are not running.
> 
> My experiences show that you may not *move around* resources while in
> maintenance mode.

Correct

>  That would indeed require a cluster-wide re-probe,
> which does not seem to happen (unless forced some way).  Probably there
> was some misunderstanding in the above discussion, I guess Ulrich meant
> moving resources when he wrote "modifying cluster resources".  Does this
> make sense?

No, I've reasonably sure he meant changing their definitions in the cib.
Or at least thats what I thought he meant at the time.

> -- 
> Thanks,
> Feri.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140827/5ccbe96f/attachment.sig>

From wferi at niif.hu  Wed Aug 27 17:09:45 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 27 Aug 2014 19:09:45 +0200
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net> (Andrew
	Beekhof's message of "Wed, 27 Aug 2014 14:54:33 +1000")
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
	<87iolf9mkc.fsf@lant.ki.iif.hu>
	<F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net>
Message-ID: <87iold7tba.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> Andrew Beekhof <andrew at beekhof.net> writes:
>> 
>>> On 22 Aug 2014, at 10:37 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>> 
>>>> While my Pacemaker cluster was in maintenance mode, resources were moved
>>>> (by hand) between the nodes as I rebooted each node in turn.  In the end
>>>> the crm status output became perfectly empty, as the reboot of a given
>>>> node removed from the output the resources which were located on the
>>>> rebooted node at the time of entering maintenance mode.  I expected full
>>>> resource discovery on exiting maintenance mode,
>>
>> I experienced this with Pacemaker version 1.1.7.  I know it's old
>> and it suffers from crmd segfault on entering maintenance mode (cf.
>> http://thread.gmane.org/gmane.linux.highavailability.user/39121), but
>> works well generally so I did not get to upgrade it yet.  Now that I
>> mentioned the crmd segfault: I noted that it died on the DC when I
>> entered maintenance mode:
>> 
>> crmd: [7452]: info: te_rsc_command: Initiating action 64: cancel vm-tmvp_monitor_60000 on n01 (local)
>> crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
>
> That looks like the lrmd died.

It did not die, at least not fully.  After entering maintenance mode
crmd asked lrmd to cancel the recurring monitor ops for all resources:

08:40:18 crmd: [7452]: info: do_te_invoke: Processing graph 20578 (ref=pe_calc-dc-1408516818-30681) derived from /var/lib/pengine/pe-input-848.bz2
08:40:18 crmd: [7452]: info: te_rsc_command: Initiating action 17: cancel dlm:0_monitor_120000 on n04
08:40:18 crmd: [7452]: info: te_rsc_command: Initiating action 84: cancel dlm:0_cancel_120000 on n01 (local)
08:40:18 lrmd: [7449]: info: cancel_op: operation monitor[194] on dlm:0 for client 7452, its parameters: [...] cancelled
08:40:18 crmd: [7452]: info: te_rsc_command: Initiating action 50: cancel dlm:2_monitor_120000 on n02

The stream of monitor op cancellation messages ended with:

08:40:18 crmd: [7452]: info: te_rsc_command: Initiating action 71: cancel vm-mdssq_monitor_60000 on n01 (local)
08:40:18 lrmd: [7449]: info: cancel_op: operation monitor[329] on vm-mdssq for client 7452, its parameters: [...] cancelled
08:40:18 crmd: [7452]: info: process_lrm_event: LRM operation vm-mdssq_monitor_60000 (call=329, status=1, cib-update=0, confirmed=true) Cancelled
08:40:18 crmd: [7452]: notice: run_graph: ==== Transition 20578 (Complete=87, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-848.bz2): Complete
08:40:18 crmd: [7452]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
08:40:18 pengine: [7451]: notice: process_pe_message: Transition 20578: PEngine Input stored in: /var/lib/pengine/pe-input-848.bz2
08:41:28 crmd: [7452]: WARN: action_timer_callback: Timer popped (timeout=10000, abort_level=0, complete=true)
08:41:28 crmd: [7452]: WARN: action_timer_callback: Ignoring timeout while not in transition
[these two lines repeated several times]
08:41:28 crmd: [7452]: WARN: action_timer_callback: Timer popped (timeout=10000, abort_level=0, complete=true)
08:41:28 crmd: [7452]: WARN: action_timer_callback: Ignoring timeout while not in transition
08:41:38 crmd: [7452]: WARN: action_timer_callback: Timer popped (timeout=20000, abort_level=0, complete=true)
08:41:38 crmd: [7452]: WARN: action_timer_callback: Ignoring timeout while not in transition
08:48:05 cib: [7447]: info: cib_stats: Processed 159 operations (23207.00us average, 0% utilization) in the last 10min
08:55:18 crmd: [7452]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
08:55:18 crmd: [7452]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
08:55:18 crmd: [7452]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
08:55:19 pengine: [7451]: notice: stage6: Delaying fencing operations until there are resources to manage
08:55:19 crmd: [7452]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
08:55:19 crmd: [7452]: info: do_te_invoke: Processing graph 20579 (ref=pe_calc-dc-1408517718-30802) derived from /var/lib/pengine/pe-input-849.bz2
08:55:19 crmd: [7452]: info: te_rsc_command: Initiating action 17: cancel dlm:0_monitor_120000 on n04
08:55:19 crmd: [7452]: info: te_rsc_command: Initiating action 84: cancel dlm:0_cancel_120000 on n01 (local)
08:55:19 crmd: [7452]: info: cancel_op: No pending op found for dlm:0:194
08:55:19 lrmd: [7449]: info: on_msg_cancel_op: no operation with id 194

Interestingly, monitor[194], lastly mentioned by lrmd, was the very
first cancelled operation.

08:55:19 crmd: [7452]: info: te_rsc_command: Initiating action 50: cancel dlm:2_monitor_120000 on n02
08:55:19 crmd: [7452]: info: te_rsc_command: Initiating action 83: cancel vm-cedar_monitor_60000 on n01 (local)
08:55:19 crmd: [7452]: ERROR: lrm_get_rsc(673): failed to receive a reply message of getrsc.
08:55:19 crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
08:55:19 crmd: [7452]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
08:55:19 crmd: [7452]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
08:55:19 crmd: [7452]: ERROR: get_lrm_resource: Could not add resource vm-cedar to LRM
08:55:19 crmd: [7452]: ERROR: do_lrm_invoke: Invalid resource definition
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1408517719-30807" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="n01" >
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input   <crm_xml >
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input     <rsc_op id="83" operation="cancel" operation_key="vm-cedar_monitor_60000" on_node="n01" on_node_uuid="n01" transition-key="83:20579:0:1b0a6e79-af5a-41e4-8ced-299371e7922c" >
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input       <primitive id="vm-cedar" long-id="vm-cedar" class="ocf" provider="niif" type="TransientDomain" />
08:55:19 crmd: [7452]: ERROR: log_data_element: Output truncated: available=727, needed=1374
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input       <attributes CRM_meta_call_id="195" [really very long]
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input     </rsc_op>
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input   </crm_xml>
08:55:19 crmd: [7452]: WARN: do_lrm_invoke: bad input </create_request_adv>

Blocks of messages like the above repeat a couple of times for other
resources, then crmd kicks the bucket and gets restarted:

08:55:19 corosync[6966]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x1dc6ea0, async-conn=0x1dc6ea0) left
08:55:19 pacemakerd: [7443]: WARN: Managed crmd process 7452 killed by signal 11 [SIGSEGV - Segmentation violation].
08:55:19 pacemakerd: [7443]: notice: pcmk_child_exit: Child process crmd terminated with signal 11 (pid=7452, rc=0)
08:55:19 pacemakerd: [7443]: notice: pcmk_child_exit: Respawning failed child process: crmd
08:55:19 pacemakerd: [7443]: info: start_child: Forked child 13794 for process crmd
08:55:19 corosync[6966]:   [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
08:55:19 crmd: [13794]: info: Invoked: /usr/lib/pacemaker/crmd 

Anyway, no further logs from lrmd after this point until hours later I
rebooted the machine:

14:37:06 pacemakerd: [7443]: notice: stop_child: Stopping lrmd: Sent -15 to process 7449
14:37:06 lrmd: [7449]: info: lrmd is shutting down
14:37:06 pacemakerd: [7443]: info: pcmk_child_exit: Child process lrmd exited (pid=7449, rc=0)

So lrmd was alive all the time.

> Which created a condition in the crmd that it couldn't handle so it
> crashed too.

Maybe their connection got severed somehow.

>> However, it got restarted seamlessly, without the node being fenced, so
>> I did not even notice this until now.  Should this have resulted in the
>> node being fenced?
>
> Depends how fast the node can respawn.

You mean how fast crmd can respawn?  How much time does it have to
respawn to avoid being fenced?

>> crmd: [13794]: ERROR: verify_stopped: Resource vm-web5 was active at shutdown.  You may ignore this error if it is unmanaged.
>
> In maintenance mode, everything is unmanaged. So that would be expected.

Is maintenance mode the same as unmanaging all resources?  I think the
latter does not cancel the monitor operations here...

>>> The discovery usually happens at the point the cluster is started on
>>> a node.
>> 
>> A local discovery did happen, but it could not find anything, as the
>> cluster was started by the init scripts, well before any resource could
>> have been moved to the freshly rebooted node (manually, to free the next
>> node for rebooting).
>
> Thats your problem then, you've started resources outside of the
> control of the cluster.

Some of them, yes, and moved the rest between the nodes.  All this
circumventing the cluster.

> Two options... recurring monitor actions with role=Stopped would have
> caught this

Even in maintenance mode?  Wouldn't they have been cancelled just like
the ordinary recurring monitor actions?

I guess adding them would run a recurring monitor operation for every
resource on every node, only with different expectations, right?

> or you can run crm_resource --cleanup after you've moved resources around.

I actually ran some crm resource cleanups for a couple of resources, and
those really were not started on exiting maintenance mode.

>>> Maintenance mode just prevents the cluster from doing anything about it.
>> 
>> Fine.  So I should have restarted Pacemaker on each node before leaving
>> maintenance mode, right?  Or is there a better way?
>
> See above

So crm_resource -r whatever -C is the way, for each resource separately.
Is there no way to do this for all resources at once?

>> You say in the above thread that resource definitions can be changed:
>> http://thread.gmane.org/gmane.linux.highavailability.user/39121/focus=39437
>> Let me quote from there (starting with the words of Ulrich Windl):
>> 
>>>>>> I think it's a common misconception that you can modify cluster
>>>>>> resources while in maintenance mode:
>>>>> 
>>>>> No, you _should_ be able to.  If that's not the case, its a bug.
>>>> 
>>>> So the end of maintenance mode starts with a "re-probe"?
>>> 
>>> No, but it doesn't need to.  
>>> The policy engine already knows if the resource definitions changed
>>> and the recurring monitor ops will find out if any are not running.
>> 
>> My experiences show that you may not *move around* resources while in
>> maintenance mode.
>
> Correct
>
>> That would indeed require a cluster-wide re-probe, which does not
>> seem to happen (unless forced some way).  Probably there was some
>> misunderstanding in the above discussion, I guess Ulrich meant moving
>> resources when he wrote "modifying cluster resources".  Does this
>> make sense?
>
> No, I've reasonably sure he meant changing their definitions in the cib.
> Or at least thats what I thought he meant at the time.

Nobody could blame you for that, because that's what it means.  But then
he inquired about a "re-probe", which fits more the problem of changing
the status of resources, not their definition.  Actually, I was so
firmly stuck in this mind set, that first I wanted to ask you to
reconsider, your response felt so much out of place.  That's all about
history for now...

After all this, I suggest to clarify this issue in the fine manual.
I've read it a couple of times, and still got the wrong impression.
-- 
Regards,
Feri.



From wferi at niif.hu  Wed Aug 27 18:56:32 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Wed, 27 Aug 2014 20:56:32 +0200
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net> (Andrew
	Beekhof's message of "Wed, 27 Aug 2014 09:46:18 +1000")
References: <871ts39e5c.fsf@lant.ki.iif.hu>
	<EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>
Message-ID: <877g1t7odb.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> crm_resource --locate finds the hosting node of a running (successfully
>> started) resource just fine.  Is there a way to similarly find out the
>> location of a resource *being* started, ie. whose resource agent is
>> already running the start action, but that action is not finished yet?
>
> You need to set record-pending=true in the op_defaults section.
> For some reason this is not yet documented :-/
>
> With this in place, crm_resource will find the correct location

I set it in a single start operation, and it works as advertised,
thanks!  At first I was suprised to see "Started" in the crm status
output while the resource was only starting, but the added order
constraint worked as expected, ie. the dependent resource started only
after the start action finished successfully.  This begs a bonus
question: how do I tell apart starting resources with record-pendig=true
and started resources?  crm_resource --locate does not help either.
-- 
Thanks,
Feri.



From andrew at beekhof.net  Wed Aug 27 22:57:26 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 28 Aug 2014 08:57:26 +1000
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <87iold7tba.fsf@lant.ki.iif.hu>
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
	<87iolf9mkc.fsf@lant.ki.iif.hu>
	<F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net>
	<87iold7tba.fsf@lant.ki.iif.hu>
Message-ID: <749665C4-F970-4C43-9228-BCFD2EE1B442@beekhof.net>


On 28 Aug 2014, at 3:09 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Andrew Beekhof <andrew at beekhof.net> writes:
> 
>> On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi at niif.hu> wrote:
> 
>>> However, it got restarted seamlessly, without the node being fenced, so
>>> I did not even notice this until now.  Should this have resulted in the
>>> node being fenced?
>> 
>> Depends how fast the node can respawn.
> 
> You mean how fast crmd can respawn?  How much time does it have to
> respawn to avoid being fenced?

Until a new node can be elected DC, invoke the policy engine and start fencing.

> 
>>> crmd: [13794]: ERROR: verify_stopped: Resource vm-web5 was active at shutdown.  You may ignore this error if it is unmanaged.
>> 
>> In maintenance mode, everything is unmanaged. So that would be expected.
> 
> Is maintenance mode the same as unmanaging all resources?  I think the
> latter does not cancel the monitor operations here...

Right. One cancels monitor operations too.

> 
>>>> The discovery usually happens at the point the cluster is started on
>>>> a node.
>>> 
>>> A local discovery did happen, but it could not find anything, as the
>>> cluster was started by the init scripts, well before any resource could
>>> have been moved to the freshly rebooted node (manually, to free the next
>>> node for rebooting).
>> 
>> Thats your problem then, you've started resources outside of the
>> control of the cluster.
> 
> Some of them, yes, and moved the rest between the nodes.  All this
> circumventing the cluster.
> 
>> Two options... recurring monitor actions with role=Stopped would have
>> caught this
> 
> Even in maintenance mode?  Wouldn't they have been cancelled just like
> the ordinary recurring monitor actions?

Good point. Perhaps they wouldn't.

> 
> I guess adding them would run a recurring monitor operation for every
> resource on every node, only with different expectations, right?
> 
>> or you can run crm_resource --cleanup after you've moved resources around.
> 
> I actually ran some crm resource cleanups for a couple of resources, and
> those really were not started on exiting maintenance mode.
> 
>>>> Maintenance mode just prevents the cluster from doing anything about it.
>>> 
>>> Fine.  So I should have restarted Pacemaker on each node before leaving
>>> maintenance mode, right?  Or is there a better way?
>> 
>> See above
> 
> So crm_resource -r whatever -C is the way, for each resource separately.
> Is there no way to do this for all resources at once?

I think you can just drop the -r

> 
>>> You say in the above thread that resource definitions can be changed:
>>> http://thread.gmane.org/gmane.linux.highavailability.user/39121/focus=39437
>>> Let me quote from there (starting with the words of Ulrich Windl):
>>> 
>>>>>>> I think it's a common misconception that you can modify cluster
>>>>>>> resources while in maintenance mode:
>>>>>> 
>>>>>> No, you _should_ be able to.  If that's not the case, its a bug.
>>>>> 
>>>>> So the end of maintenance mode starts with a "re-probe"?
>>>> 
>>>> No, but it doesn't need to.  
>>>> The policy engine already knows if the resource definitions changed
>>>> and the recurring monitor ops will find out if any are not running.
>>> 
>>> My experiences show that you may not *move around* resources while in
>>> maintenance mode.
>> 
>> Correct
>> 
>>> That would indeed require a cluster-wide re-probe, which does not
>>> seem to happen (unless forced some way).  Probably there was some
>>> misunderstanding in the above discussion, I guess Ulrich meant moving
>>> resources when he wrote "modifying cluster resources".  Does this
>>> make sense?
>> 
>> No, I've reasonably sure he meant changing their definitions in the cib.
>> Or at least thats what I thought he meant at the time.
> 
> Nobody could blame you for that, because that's what it means.  But then
> he inquired about a "re-probe", which fits more the problem of changing
> the status of resources, not their definition.  Actually, I was so
> firmly stuck in this mind set, that first I wanted to ask you to
> reconsider, your response felt so much out of place.  That's all about
> history for now...
> 
> After all this, I suggest to clarify this issue in the fine manual.
> I've read it a couple of times, and still got the wrong impression.

Which specific section do you suggest?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140828/055089de/attachment.sig>

From andrew at beekhof.net  Wed Aug 27 22:58:23 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Thu, 28 Aug 2014 08:58:23 +1000
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <877g1t7odb.fsf@lant.ki.iif.hu>
References: <871ts39e5c.fsf@lant.ki.iif.hu>
	<EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>
	<877g1t7odb.fsf@lant.ki.iif.hu>
Message-ID: <9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net>


On 28 Aug 2014, at 4:56 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Andrew Beekhof <andrew at beekhof.net> writes:
> 
>> On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:
>> 
>>> crm_resource --locate finds the hosting node of a running (successfully
>>> started) resource just fine.  Is there a way to similarly find out the
>>> location of a resource *being* started, ie. whose resource agent is
>>> already running the start action, but that action is not finished yet?
>> 
>> You need to set record-pending=true in the op_defaults section.
>> For some reason this is not yet documented :-/
>> 
>> With this in place, crm_resource will find the correct location
> 
> I set it in a single start operation, and it works as advertised,
> thanks!  At first I was suprised to see "Started" in the crm status
> output while the resource was only starting, but the added order
> constraint worked as expected, ie. the dependent resource started only
> after the start action finished successfully.  This begs a bonus
> question: how do I tell apart starting resources with record-pendig=true
> and started resources?

I'm reasonably sure we don't expose that via crm_resource.
Seems like a reasonable thing to do though.

crm_mon /might/ show pending though.


>  crm_resource --locate does not help either.
> -- 
> Thanks,
> Feri.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140828/5eb50fb0/attachment.sig>

From neale at sinenomine.net  Thu Aug 28 19:11:24 2014
From: neale at sinenomine.net (Neale Ferguson)
Date: Thu, 28 Aug 2014 19:11:24 +0000
Subject: [Linux-cluster] Delaying fencing during shutdown
Message-ID: <0E22A6F6-977A-4E58-A0ED-9D596D6B1A20@sinenomine.net>

Hi,
 In a two node cluster I shutdown one of the nodes and the other node notices the shutdown but on rare occasions that node will then fence the node that is shutting down. I assume Is this a situation where setting post_fail_delay would be useful or setting the totem timeout to something higher than its default.

Neale



From wferi at niif.hu  Thu Aug 28 23:00:24 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 29 Aug 2014 01:00:24 +0200
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net> (Andrew
	Beekhof's message of "Thu, 28 Aug 2014 08:58:23 +1000")
References: <871ts39e5c.fsf@lant.ki.iif.hu>
	<EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>
	<877g1t7odb.fsf@lant.ki.iif.hu>
	<9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net>
Message-ID: <87sikgb4on.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 28 Aug 2014, at 4:56 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> Andrew Beekhof <andrew at beekhof.net> writes:
>> 
>>> On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>> 
>>>> crm_resource --locate finds the hosting node of a running (successfully
>>>> started) resource just fine.  Is there a way to similarly find out the
>>>> location of a resource *being* started, ie. whose resource agent is
>>>> already running the start action, but that action is not finished yet?
>>> 
>>> You need to set record-pending=true in the op_defaults section.
>>> For some reason this is not yet documented :-/
>>> 
>>> With this in place, crm_resource will find the correct location
>> 
>> I set it in a single start operation, and it works as advertised,
>> thanks!  At first I was suprised to see "Started" in the crm status
>> output while the resource was only starting, but the added order
>> constraint worked as expected, ie. the dependent resource started only
>> after the start action finished successfully.  This begs a bonus
>> question: how do I tell apart starting resources with record-pendig=true
>> and started resources?
>
> I'm reasonably sure we don't expose that via crm_resource.
> Seems like a reasonable thing to do though.
>
> crm_mon /might/ show pending though.

Version 1.1.7 does not.  Looks like call-id="-1" signs the pending
operations of an <lrm_resource>, so pulling this info out of the CIB
is not too complicated.
-- 
Regards,
Feri.



From wferi at niif.hu  Fri Aug 29 00:54:57 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 29 Aug 2014 02:54:57 +0200
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <749665C4-F970-4C43-9228-BCFD2EE1B442@beekhof.net> (Andrew
	Beekhof's message of "Thu, 28 Aug 2014 08:57:26 +1000")
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
	<87iolf9mkc.fsf@lant.ki.iif.hu>
	<F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net>
	<87iold7tba.fsf@lant.ki.iif.hu>
	<749665C4-F970-4C43-9228-BCFD2EE1B442@beekhof.net>
Message-ID: <87oav4azdq.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 28 Aug 2014, at 3:09 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> So crm_resource -r whatever -C is the way, for each resource separately.
>> Is there no way to do this for all resources at once?
>
> I think you can just drop the -r

Unfortunately, that does not work under version 1.1.7:

$ sudo crm_resource -C
Error performing operation: The object/attribute does not exist

>> Andrew Beekhof <andrew at beekhof.net> writes:
>> 
>>> On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>> 
>>>> My experiences show that you may not *move around* resources while in
>>>> maintenance mode.
>>> 
>>> Correct
>>> 
>>>> That would indeed require a cluster-wide re-probe, which does not
>>>> seem to happen (unless forced some way).
>> 
>> After all this, I suggest to clarify this issue in the fine manual.
>> I've read it a couple of times, and still got the wrong impression.
>
> Which specific section do you suggest?

5.7.1. Monitoring Resources for Failure

Some points worth adding/emphasizing would be:
1. documentation of the role property (role=Master is mentioned later,
   but role=Stopped never)
2. In maintenance mode, monitor operations don't run
3. If management of a resource is switched off, its role=Started monitor
   operation continues running until failure, then the role=Stopped
   kicks in (I'm guessing here; also, what about the other nodes?)
4. When management is enabled again, no re-probe happens, the cluster
   expects the last state and location to be still valid
5. so don't even move unmanaged resources
6. unless you started a resource somewhere before starting the cluster
   on that node, or you cleaned up the resource
7. same is true for maintenance mode, but for all resources.

I have to agree that most of this is evident once you know it.
Unfortunately, it's also easy to get wrong while learning the ropes.
For example, hastexo has some good information online:
http://www.hastexo.com/resources/hints-and-kinks/maintenance-active-pacemaker-clusters
But from the sentence "in maintenance mode, you can stop or restart
cluster resources at will" I still miss the constraint of not moving the
resource between the nodes.  Also, setting enabled="false" works funny,
it did not get rid of the monitor operation before I set the resource to
managed, and deleting the setting or changing it to true did bring it
back.  I had to restart the resource to have monitor ops again.  Why?
-- 
Thanks,
Feri.



From wferi at niif.hu  Fri Aug 29 00:57:50 2014
From: wferi at niif.hu (Ferenc Wagner)
Date: Fri, 29 Aug 2014 02:57:50 +0200
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net> (Andrew
	Beekhof's message of "Thu, 28 Aug 2014 08:58:23 +1000")
References: <871ts39e5c.fsf@lant.ki.iif.hu>
	<EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>
	<877g1t7odb.fsf@lant.ki.iif.hu>
	<9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net>
Message-ID: <87ha0waz8x.fsf@lant.ki.iif.hu>

Andrew Beekhof <andrew at beekhof.net> writes:

> On 28 Aug 2014, at 4:56 am, Ferenc Wagner <wferi at niif.hu> wrote:
>
>> Andrew Beekhof <andrew at beekhof.net> writes:
>> 
>>> On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>> 
>>>> crm_resource --locate finds the hosting node of a running (successfully
>>>> started) resource just fine.  Is there a way to similarly find out the
>>>> location of a resource *being* started, ie. whose resource agent is
>>>> already running the start action, but that action is not finished yet?
>>> 
>>> You need to set record-pending=true in the op_defaults section.
>>> For some reason this is not yet documented :-/
>>> 
>>> With this in place, crm_resource will find the correct location
>> 
>> I set it in a single start operation, and it works as advertised,
>> thanks!  At first I was suprised to see "Started" in the crm status
>> output while the resource was only starting, but the added order
>> constraint worked as expected, ie. the dependent resource started only
>> after the start action finished successfully.  This begs a bonus
>> question: how do I tell apart starting resources with record-pendig=true
>> and started resources?
>
> I'm reasonably sure we don't expose that via crm_resource.
> Seems like a reasonable thing to do though.

crm_resource -O outputs lines like this:
[...] Started : vm-elm_start_0 (node=lant, call=-1, rc=14): pending
which seems good enough for now.
-- 
Thanks,
Feri.



From andrew at beekhof.net  Fri Aug 29 02:31:36 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Fri, 29 Aug 2014 12:31:36 +1000
Subject: [Linux-cluster] locating a starting resource
In-Reply-To: <87ha0waz8x.fsf@lant.ki.iif.hu>
References: <871ts39e5c.fsf@lant.ki.iif.hu>
	<EBFD09CA-65DD-4759-B7F3-B73039FC6B3F@beekhof.net>
	<877g1t7odb.fsf@lant.ki.iif.hu>
	<9D0DD413-AB20-4F25-ADF5-02D8471EAA18@beekhof.net>
	<87ha0waz8x.fsf@lant.ki.iif.hu>
Message-ID: <77CDE52A-401F-4851-ABFB-3A643F9913CD@beekhof.net>


On 29 Aug 2014, at 10:57 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Andrew Beekhof <andrew at beekhof.net> writes:
> 
>> On 28 Aug 2014, at 4:56 am, Ferenc Wagner <wferi at niif.hu> wrote:
>> 
>>> Andrew Beekhof <andrew at beekhof.net> writes:
>>> 
>>>> On 27 Aug 2014, at 6:42 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>>> 
>>>>> crm_resource --locate finds the hosting node of a running (successfully
>>>>> started) resource just fine.  Is there a way to similarly find out the
>>>>> location of a resource *being* started, ie. whose resource agent is
>>>>> already running the start action, but that action is not finished yet?
>>>> 
>>>> You need to set record-pending=true in the op_defaults section.
>>>> For some reason this is not yet documented :-/
>>>> 
>>>> With this in place, crm_resource will find the correct location
>>> 
>>> I set it in a single start operation, and it works as advertised,
>>> thanks!  At first I was suprised to see "Started" in the crm status
>>> output while the resource was only starting, but the added order
>>> constraint worked as expected, ie. the dependent resource started only
>>> after the start action finished successfully.  This begs a bonus
>>> question: how do I tell apart starting resources with record-pendig=true
>>> and started resources?
>> 
>> I'm reasonably sure we don't expose that via crm_resource.
>> Seems like a reasonable thing to do though.
> 
> crm_resource -O outputs lines like this:
> [...] Started : vm-elm_start_0 (node=lant, call=-1, rc=14): pending
> which seems good enough for now.
> -- 

More recent versions also have:

       -j, --pending
              Display pending state if 'record-pending' is enabled


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140829/0ed8c128/attachment.sig>

From andrew at beekhof.net  Fri Aug 29 02:32:50 2014
From: andrew at beekhof.net (Andrew Beekhof)
Date: Fri, 29 Aug 2014 12:32:50 +1000
Subject: [Linux-cluster] on exiting maintenance mode
In-Reply-To: <87oav4azdq.fsf@lant.ki.iif.hu>
References: <8761hlickn.fsf@lant.ki.iif.hu>
	<67506B71-8594-4C16-82C1-F94779F59826@beekhof.net>
	<87iolf9mkc.fsf@lant.ki.iif.hu>
	<F1686CC9-7920-4B72-981D-D6057EC2B754@beekhof.net>
	<87iold7tba.fsf@lant.ki.iif.hu>
	<749665C4-F970-4C43-9228-BCFD2EE1B442@beekhof.net>
	<87oav4azdq.fsf@lant.ki.iif.hu>
Message-ID: <B9EF793F-8CCC-4AA7-82C9-CFDD64EC9537@beekhof.net>


On 29 Aug 2014, at 10:54 am, Ferenc Wagner <wferi at niif.hu> wrote:

> Andrew Beekhof <andrew at beekhof.net> writes:
> 
>> On 28 Aug 2014, at 3:09 am, Ferenc Wagner <wferi at niif.hu> wrote:
>> 
>>> So crm_resource -r whatever -C is the way, for each resource separately.
>>> Is there no way to do this for all resources at once?
>> 
>> I think you can just drop the -r
> 
> Unfortunately, that does not work under version 1.1.7:

You know what I'm going to say here right?

> 
> $ sudo crm_resource -C
> Error performing operation: The object/attribute does not exist
> 
>>> Andrew Beekhof <andrew at beekhof.net> writes:
>>> 
>>>> On 27 Aug 2014, at 3:40 am, Ferenc Wagner <wferi at niif.hu> wrote:
>>>> 
>>>>> My experiences show that you may not *move around* resources while in
>>>>> maintenance mode.
>>>> 
>>>> Correct
>>>> 
>>>>> That would indeed require a cluster-wide re-probe, which does not
>>>>> seem to happen (unless forced some way).
>>> 
>>> After all this, I suggest to clarify this issue in the fine manual.
>>> I've read it a couple of times, and still got the wrong impression.
>> 
>> Which specific section do you suggest?
> 
> 5.7.1. Monitoring Resources for Failure

Ok, I'll endeavour to improve that section :)

> 
> Some points worth adding/emphasizing would be:
> 1. documentation of the role property (role=Master is mentioned later,
>   but role=Stopped never)
> 2. In maintenance mode, monitor operations don't run
> 3. If management of a resource is switched off, its role=Started monitor
>   operation continues running until failure, then the role=Stopped
>   kicks in (I'm guessing here; also, what about the other nodes?)
> 4. When management is enabled again, no re-probe happens, the cluster
>   expects the last state and location to be still valid
> 5. so don't even move unmanaged resources
> 6. unless you started a resource somewhere before starting the cluster
>   on that node, or you cleaned up the resource
> 7. same is true for maintenance mode, but for all resources.
> 
> I have to agree that most of this is evident once you know it.
> Unfortunately, it's also easy to get wrong while learning the ropes.
> For example, hastexo has some good information online:
> http://www.hastexo.com/resources/hints-and-kinks/maintenance-active-pacemaker-clusters
> But from the sentence "in maintenance mode, you can stop or restart
> cluster resources at will" I still miss the constraint of not moving the
> resource between the nodes.  Also, setting enabled="false" works funny,
> it did not get rid of the monitor operation before I set the resource to
> managed, and deleting the setting or changing it to true did bring it
> back.  I had to restart the resource to have monitor ops again.  Why?
> -- 
> Thanks,
> Feri.
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140829/359fb687/attachment.sig>

From manish631 at rediffmail.com  Sat Aug 30 14:12:42 2014
From: manish631 at rediffmail.com (manish vaidya)
Date: 30 Aug 2014 14:12:42 -0000
Subject: [Linux-cluster] =?utf-8?q?Please_help_me_on_cluster_error?=
Message-ID: <20140830141242.4308.qmail@f5mail-224-126.rediffmail.com>

i created four node cluster in kvm enviorment But i faced error when create new pv such as pvcreate /dev/sdb1
got error , lock from node 2 & lock from node3

also strange cluster logs

Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e

    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
    5f
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f
    60
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63
    64
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69
    6a
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84
    85
    Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a
    9b


Please help me on this issue
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140830/9fa36bfc/attachment.htm>

From emi2fast at gmail.com  Sat Aug 30 14:53:08 2014
From: emi2fast at gmail.com (emmanuel segura)
Date: Sat, 30 Aug 2014 16:53:08 +0200
Subject: [Linux-cluster] Please help me on cluster error
In-Reply-To: <20140830141242.4308.qmail@f5mail-224-126.rediffmail.com>
References: <20140830141242.4308.qmail@f5mail-224-126.rediffmail.com>
Message-ID: <CAE7pJ3AKUvW1ia3MAPWjE1Mrwfu10F=dCt+E48nicSJDjbJhrA@mail.gmail.com>

are you using clvmd? if your answer is = yes, you need to be sure, you pv
is visibile to your cluster nodes


2014-08-30 16:12 GMT+02:00 manish vaidya <manish631 at rediffmail.com>:

> i created four node cluster in kvm enviorment But i faced error when
> create new pv such as pvcreate /dev/sdb1
> got error , lock from node 2 & lock from node3
>
> also strange cluster logs
>
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
>
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
> 5f
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f
> 60
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63
> 64
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69
> 6a
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84
> 85
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a
> 9b
>
>
> Please help me on this issue
>
> <http://sigads.rediff.com/RealMedia/ads/click_nx.ads/www.rediffmail.com/signatureline.htm at Middle?>
> Get your own *FREE* website, *FREE* domain & *FREE* mobile app with
> Company email.
> *Know More >*
> <http://track.rediff.com/click?url=___http://businessemail.rediff.com/email-ids-for-companies-with-less-than-50-employees?sc_cid=sign-1-10-13___&cmp=host&lnk=sign-1-10-13&nsrv1=host>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20140830/a26e7d1d/attachment.htm>

From lists at alteeve.ca  Sat Aug 30 16:35:52 2014
From: lists at alteeve.ca (Digimer)
Date: Sat, 30 Aug 2014 12:35:52 -0400
Subject: [Linux-cluster] Please help me on cluster error
In-Reply-To: <20140830141242.4308.qmail@f5mail-224-126.rediffmail.com>
References: <20140830141242.4308.qmail@f5mail-224-126.rediffmail.com>
Message-ID: <5401FD68.8000407@alteeve.ca>

Can you share your cluster information please?

This could be a network problem, as the messages below happen when the 
network between the nodes isn't fast enough or has too long latency and 
cluster traffic is considered lost and re-requested.

If you don't have fencing working properly, and if a network issue 
caused a node to be declared lost, clustered LVM (and anything else 
using cluster locking) will fail (by design).

If you share your configuration and more of your logs, it will help us 
understand what is happening. Please also tell us what version of the 
cluster software you're using.

digimer

On 30/08/14 10:12 AM, manish vaidya wrote:
> i created four node cluster in kvm enviorment But i faced error when
> create new pv such as pvcreate /dev/sdb1
> got error , lock from node 2 & lock from node3
>
> also strange cluster logs
>
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
>
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5e
> 5f
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 5f
> 60
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 61
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 63
> 64
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 69
> 6a
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 78
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 84
> 85
> Jun 10 14:46:24 node1 corosync[3266]: [TOTEM ] Retransmit List: 9a
> 9b
>
>
> Please help me on this issue
> <http://sigads.rediff.com/RealMedia/ads/click_nx.ads/www.rediffmail.com/signatureline.htm at Middle?>
>
> Get your own *FREE* website, *FREE* domain & *FREE* mobile app with
> Company email.
> 	*Know More >*
> <http://track.rediff.com/click?url=___http://businessemail.rediff.com/email-ids-for-companies-with-less-than-50-employees?sc_cid=sign-1-10-13___&cmp=host&lnk=sign-1-10-13&nsrv1=host>
>
>
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?