[389-users] replication broken again, with 389 MMR replication and TCP errors

Rich Megginson rmeggins at redhat.com
Fri Dec 11 23:05:41 UTC 2009


Note - changed the subject line for follow ups - I don't think the 
problem is a "corrupted database" - it's just that replication stops 
working for some reason.

john.bryson at oit.gatech.edu wrote:
>
> We are still having this issue, every so often. Sometimes a 389 
> database becomes unresponsive to queries, and the log shows errors  
> (full log below)
>
> We started a thread back in Nov, but didnt really get a resolution.
>    
> https://www.redhat.com/archives/fedora-directory-users/2009-November/msg00056.html 
>
>
> The only way to fix it seems to be to re-import a replication export 
> from a provider.
> I am new, so I'm not sure whether this is the right place to report 
> this, or should I submit a bug?
>
> Some background - we have 3 masters that replication to each other and 
> to 2 replication hubs. Those hubs replicate to each other and to 12 
> slaves in 2 data centers. We have host based firewalls, network based 
> firewalls, and routers, and our replications do pass through firewalls 
> and routers, even between masters. We have reduced,but still see a lot 
> of these errors in the 389 errors log:
>  [11/Dec/2009:00:06:19 -0500] - PR_Write(133) Netscape Portable 
> Runtime error -5961 (TCP connection reset by peer.)
>
> and we have worked to reduce these errors, but have not been able to 
> eliminate them entirely. So, based on that, I am concerned that there 
> are replication sessions that might be getting interrupted, and that 
> might be contributing to the problem.
Yes, I think that is the case.  I believe the connection errors cause 
the replication problems.  I believe this PR_Write is the send() of the 
response back to the replication supplier, which for some reason is not 
sent, because the send()er thinks the connection has been closed (tcp 
reset).  The supplier never completes the sequence, never receives the 
expected reply, and gets out of whack.
>
> The situation doesnt happen daily, or neccessarily correlate to all of 
> the TCP errors. But the situation could happen again during the next 
> week or so. What other information and logging should we be gathering 
> to figure out what is happening here? Should we create a bug on this?
Yes, please create a bug.  This is very similar to 
https://bugzilla.redhat.com/show_bug.cgi?id=488866 crash/assert in 
replication when supplier attempts to reconnect to consumer during init 
or update
>
> This is the log from this morning at 4am on zuber, a slave, where the 
> 'accounts' db was affected:
>
> zuber.iam.gatech.edu: cat errors
>        389-Directory/1.2.2 B2009.237.2054
>        zuber.iam.gatech.edu:636 (/etc/dirsrv/slapd-zuber)
>
> [11/Dec/2009:00:06:19 -0500] - PR_Write(133) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:06:19 -0500] - PR_Write(75) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:06:20 -0500] - PR_Write(161) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:10:43 -0500] - PR_Write(217) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:11:48 -0500] - PR_Write(246) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:20 -0500] - PR_Write(256) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:24 -0500] - PR_Write(218) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:35 -0500] - PR_Write(243) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:45 -0500] - PR_Write(264) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:54 -0500] - PR_Write(274) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:14:59 -0500] - PR_Write(284) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:15:00 -0500] - PR_Write(287) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:00:15:00 -0500] - PR_Write(220) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:00 -0500] - PR_Write(83) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:04 -0500] - PR_Write(224) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:07 -0500] - PR_Write(94) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:07 -0500] - PR_Write(144) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:08 -0500] - PR_Write(222) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:08 -0500] - PR_Write(227) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:34:08 -0500] - PR_Write(233) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
Do you see a corresponding connection close in the access log at around 
the time of the PR_Write error?  If so, you could probably work 
backwards from there to figure out where the connection is from and what 
operations were going on.
> [11/Dec/2009:04:34:59 -0500] - ldbm: 'accounts' is already in the 
> middle of another task and cannot be disturbed.
I'm not sure what task it could be in the middle of here.  It's possible 
that the supplier is attempting to reinit the consumer, but the first 
init never completed.
> [11/Dec/2009:04:34:59 -0500] - slapi_start_bulk_import: failed; error 
> = -23
> [11/Dec/2009:04:35:00 -0500] NSMMReplicationPlugin - conn=2847000 
> op=178 
> replica="ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu": 
> Unable to acquire replica: error: internal error
> [11/Dec/2009:04:35:00 -0500] NSMMReplicationPlugin - 
> multimaster_be_state_change: replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu is 
> going offline; disabling replication
> [11/Dec/2009:04:35:02 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:35:05 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:36:02 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:36:13 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:36:25 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:36:39 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:36:46 -0500] - PR_Write(220) Netscape Portable Runtime 
> error -5961 (TCP connection reset by peer.)
> [11/Dec/2009:04:37:09 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:37:12 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:37:15 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:37:35 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:38:03 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:38:09 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:38:31 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:38:37 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:33 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:37 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:44 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:46 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:49 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:52 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:56 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:39:59 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:07 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:12 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:15 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:20 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:25 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:31 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:37 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:46 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:40:51 -0500] - WARNING: Import is running with 
> nsslapd-db-private-import-mem on; No other process is allowed to 
> access the database
> [11/Dec/2009:04:40:51 -0500] - ERROR bulk import abandoned
> [11/Dec/2009:04:40:51 -0500] - import accounts: Aborting all import 
> threads...
> [11/Dec/2009:04:41:00 -0500] - import accounts: Import threads aborted.
> [11/Dec/2009:04:41:00 -0500] - import accounts: Closing files...
> [11/Dec/2009:04:41:00 -0500] - import accounts: Import failed.
> [11/Dec/2009:04:41:00 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
> [11/Dec/2009:04:41:29 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica 
> ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu: 
> LDAP error - 1
>
>  (this error then repeats for hours)
>
> -- 
> 389 users mailing list
> 389-users at redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-directory-users




More information about the Fedora-directory-users mailing list