[Linux-cluster] NTP time steps causes cluster reconfiguration

Fri Jul 16 14:36:17 UTC 2010

Hi,
 i can confirm, that time steps do cause reconfiguration. Not sure if this
was the reason, but one of my nodes was fenced from time to time
(previously) after several reconfigurations and also it caused some
problems with gfs being withdrawn.
 ntpdate running as cron job does step changes, but ntpd should not cause
step changes. It should instead speed-up or slow-down the clock until it is
synchronized. However using the -g option you may ask that the clock jumps
once at the start of ntpd.
 I have configured all cluster nodes to synchronize from each other via
ntpd (configured as peers) and each from one (different) additional
(startum 1 or 2) source as server. Since then i don't see reconfiguration
in the logs.

On Fri, 16 Jul 2010 14:18:22 +0100, "Martin Waite"
<Martin.Waite at datacash.com> wrote:
> Hi,
> 
>  
> 
> During testing, I noticed that a time step caused by ntpd caused the
> cluster to drop into GATHER state:
> 
>  
> 
> Jun 16 12:13:16 cp1edidbm001 ntpd[30917]: time reset -16.332117 s
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering GATHER
> state from 12.
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Creating commit
> token because I am the rep.
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Saving state aru 9e
> high seq received 9e
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Storing new
> sequence id for ring 328
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering COMMIT
> state.
> 
> Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering RECOVERY
> state.
> 
> ...
> 
>  
> 
> This is easily repeatable through setting the clock forwards by 20
> seconds using /bin/date.  This probably causes comms timeouts to expire
> prematurely, and almost every time causes the cluster to reconfigure -
> luckily without affecting running services.
> 
>  
> 
> Stepping the clock backwards also causes a similar disruption, but there
> is a long lag between changing the time and the cluster reconfiguring:
> perhaps this extends a timeout or sleep on the affected node, causing
> genuine timeouts on the other nodes.
> 
>  
> 
> All I am looking for is some reassurance that clock changes are not
> going to crash the cluster.  Is anyone able to confirm this please ?
> 
>  
> 
> regards,
> 
> Martin