[Linux-cluster] F_SETLK fails after recovery

Mon Sep 8 17:02:28 UTC 2014

Will do. I'm struggling to understand the mechanics of checkpointing. When I call saCktCheckpointOpen etc. what are the entities they are dealing with? Is this information centralized on the master and disseminated to the other members? Does the information only reside in memory or is it written anywhere? I suppose what I'm asking for is there doc on openais internals that will explain this to me rather than asking naive and repetitive questions?

Also, would setting:

	<logging>
		<logging_daemon debug="on" logfile="/var/log/cluster/checkpoint.log" logfile_priority="debug" name="corosync" subsys="CKPT"/>
	</logging>

In cluster.conf capture what I need to help track this down or are there some additional entries in the <logging> section required?

Thanks so much for taking the time to respond... Neale

On Sep 8, 2014, at 12:15 PM, David Teigland <teigland at redhat.com> wrote:

> On Mon, Sep 08, 2014 at 03:35:05PM +0000, Neale Ferguson wrote:
> 
> The checkpoint data is sent to corosync/openais, which is responsible for
> syncing that data to the other nodes, which should then be able to open
> and read it.  You'll also want to look for corosync/openais errors related
> to checkpoints.
> 
>> Also, when I try an imitate the situation by holding a R/W lock and then
>> causing that node to restart without shutting down (and releasing the
>> lock), the other node purges the lock when it detects the failing node
>> has disappeared. I don't understand why the locks reported in the
>> previous mail aren't purged as well.
> 
> The problem is almost certainly with the operation of the checkpoints, not
> with the locking.
>