[Freeipa-users] Replication failing on FreeIPA 4.2.0

Nathan Peters Nathan.Peters at globalrelay.net
Sun Jan 17 12:14:00 UTC 2016


After a bunch more troubleshooting I finally have logs that are error free on all 4 servers :-)

I couldn't find anything really useful on Google about this particular error : attrlist_replace - attr_replace (nsslapd-referral, ldap://ipadc.mydomain.net:389/o%3Dipaca) failed

So I am going to write about my experiences fixing it.  There was a clue in a thread here : https://www.redhat.com/archives/freeipa-users/2015-March/msg00699.html

But if you are like me and chose FreeIPA because you wanted to spend your time managing a lot of computers without worrying about the gorry technical details of 389 directory server, the answer given in that thread needs some explaining.

On every domain controller in your network run this command : 

ldapsearch -D "cn=directory manager" -W -b "o=ipaca" "(&(objectclass=nstombstone)(nsUniqueId=ffffffff-ffffffff-ffffffff-ffffffff))" nscpentrywsi

In the output on each server, look for the following key : It tells you the server's current ID :

nscpentrywsi: nsDS5ReplicaId: 1195

Now look for the ruv entries that look like this : 

nscpentrywsi: nsds50ruv: {replica 1195 ldap://dc1-ipa-dev-nvan.mydomain
 .net:389} 569afd7c000004ab0000 569b5b0e000004ab0000

Any of those ruvs that have an id number after the word replica need to be deleted if the number doesn't match the number of one of your servers.  They are old entries from previously deleted agreements.  Don't delete any that your servers identified themselves current as though, that will crash the server.  Use the following ldap query to delete the old ones (where 21 in CLEANRUV21 is the id number of the agreement you want to delete) : 

ldapmodify -x -D "cn=directory manager" -W <<EOF
> dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
> changetype: modify
> replace: nsds5task
> nsds5task: CLEANRUV21
> EOF

I noticed more strange behavior here because even after I deleted every old RUV, one of them came back all by itself.  I assumed it must be part of an agreement somewhere in the system and was getting re-created automatically so I went hunting for more info.  I noticed that the amount of unique servers listed in the error log message on each server uniquely matched the number of maxcsn entries in the ldap output of the tombstone search on each server.  The entries looked like this : 

nscpentrywsi: nsds5agmtmaxcsn: o=ipaca;dc2-ipa-dev-van.mydomain.net-to-
 dc1-ipa-dev-nvan.mydomain.net;dc1-ipa-dev-nvan.mydomain.net;389
 ;unavailable
nscpentrywsi: nsds5agmtmaxcsn: o=ipaca;masterAgreement1-dc2-ipa-dev-nvan.mydomain.net-pki-tomcat;dc2-ipa-dev-nvan.mydomain.net;389;1095;569ae
 e5a000300380000

I could tell by looking at the unavailable it meant it was having trouble getting a csn number, but I didn't know how to delete them safely with ldap syntax.  Luckily, the new 4.3.0 interface calls these maxcsn entries segments.  Removing them using the web ui is kind of round about, but works eventually.  On each server, go to the web ui and one at a time delete and re-create all segments in the ca topology USING THE TEXT BASED ONE, NOT THE GRAPHIC ONE (this requires domain level 1).  The reason this works is because the command to delete a domain level segment also doubles as a command to clean local segments that are still in the old local part of the ldap tree from domain level 0. 

You still have to repeat it on each server (which is kind of funny because you are deleting the domain level objects multiple times, but only because you need to cause the local trigger on each server).

I noticed that after re-creation the names of the maxcsn entries in that ldap query result are much more uniform.  There are no 'masterAgreement' csn types, all member servers that are not the CA master have no entries at all, even after replication, and on the master, they are all labelled with the -to- syntax instead of the pki syntax.  I also noticed that some of my old invalid agreements had the same server name on both sides of the -to- and now they all perfectly match the segment names in the web ui.

I'm assuming all the bugs in 4.1.4 and 4.2.0 and 4.2.3 created a lot of garbage entries.

Luckily, with the tools in 4.3.0 those can all be removed.

I have now been staring at logs that have zero errors for over 30 minutes, and I was previously getting hundreds per second.

Although this is great news for me, it is not great news for anyone stuck on a CentOS or RHEL machine with no upgrade path to 4.3.0 without switching to Fedora who is experiencing the category of bugs (there were definitely multiple ones) that I encountered trying to fix these replication issues.

-----Original Message-----
From: Nathan Peters 
Sent: January-17-16 1:10 AM
To: Nathan Peters
Cc: freeipa-users at redhat.com
Subject: RE: [Freeipa-users] Replication failing on FreeIPA 4.2.0

After some amount of work, I was able to get my system back to a state where it seems to be replicating ok, but not with FreeIPA 4.2.0.  Because this was a production system with several hundred users and computers attached to it, a wipe of the domain was not an option so I decided to chance that the new replication topology features would help.

I replaced each CentOS 7 domain controller with a Fedora 23 FreeIPA 4.2.3 host and while doing so I noticed an odd behavior of the RUVs.  I know about the current bug where deleting a replica doesn't delete its RUV and I experienced that. I would run a command like this :

dn: cn=clean 4, cn=cleanallruv, cn=tasks, cn=config
objectclass: top
objectclass: extensibleObject
replica-base-dn: dc=mydomain,dc=net
replica-id: 4
replica-force-cleaning: yes
cn: clean 4

It would fail only if I was not in a current agreement with the new Fedora RUV for that host.  Ie, if the old CentOS host had a RUV of 4, and the new Fedora host 15, and I was in an agreement with 15, that ldap code would delete 4, but if I was not in an agreement with 15, it would fail.

After A while I had every server in an agreement with all others and got all the old RUVs cleared.

I was still experiencing strange error messages in my logs with FreeIPA 4.2.3 so I decided to go all the way to 4.3.0.

Here are the 4.2.3 errors :

[16/Jan/2016:22:29:12 -0800] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica dc=mydomain,dc=net: LDAP error - 53
[16/Jan/2016:22:29:13 -0800] NSMMReplicationPlugin - agmt_delete: begin
[16/Jan/2016:22:32:51 -0800] slapi_ldap_bind - Error: could not bind id [cn=Replication Manager masterAgreement1-dc2-ipa-dev-van.mydomain.net-pki-tomcat,ou=csusers,cn=config] authentication mechanism [SIMPLE]: error 32 (No such object) errno 0 (Success)

On 4 servers, 3 upgrades to 4.3.0 went smooth, and 1 just hung during the %post section of the dnf install for an hour with ns-lapd process taking 100% cpu on all 4 cores until I stopped it.  A subsequent ipa-server-upgrade fixed everything.

With the new replication topology management graphs and controls in the ui, I was able to find some missing segments and replace some that were for some reason only 1 way.

Replication seems to actually be proceeding smoothly and now instead of getting the hundreds of error log entries per second that I had reported in my earlier posts, I am only getting about 3 every 5 minutes.  The bugs that were present in 4.2.0 and 4.2.3 seem to be almost entirely gone.

I have ran the new topology suffix verification commands and they say everything is ok.

I still get these errors in batches of 3, but they don't seem to be doing anything harmful in terms of my systems ability to operating and replicate properly :

[17/Jan/2016:01:07:27 -0800] attrlist_replace - attr_replace (nsslapd-referral, ldap://dc1-ipa-dev-nvan.mydomain.net:389/o%3Dipaca) failed.

-----Original Message-----
From: freeipa-users-bounces at redhat.com [mailto:freeipa-users-bounces at redhat.com] On Behalf Of Nathan Peters
Sent: January-15-16 10:00 AM
To: Ludwig Krispenz
Cc: freeipa-users at redhat.com
Subject: Re: [Freeipa-users] Replication failing on FreeIPA 4.2.0

No dice on the rebuild and RUV cleaning. I'm still getting a pile of these on dc1-van : 

[15/Jan/2016:17:55:25 +0000] NSMMReplicationPlugin - agmt="cn=meTodc1-ipa-dev-nvan.mydomain.net" (dc1-ipa-dev-nvan:389): Skipping update operation with no message_id (uniqueid 6e6784a0-b5c911e5-b1f1cd78-f19552bb, CSN 569932db000000040000):

I'm also getting these on dc1-nvan: 

[15/Jan/2016:17:45:36 +0000] attrlist_replace - attr_replace (nsslapd-referral, ldap://dc1-ipa-dev-van.mydomain.net:389/o%3Dipaca) failed.




-----Original Message-----
From: Ludwig Krispenz [mailto:lkrispen at redhat.com] 
Sent: January-15-16 12:19 AM
To: Nathan Peters
Cc: Rob Crittenden; freeipa-users at redhat.com
Subject: Re: [Freeipa-users] Replication failing on FreeIPA 4.2.0


On 01/15/2016 08:32 AM, Nathan Peters wrote:
> I think I've finally started to make some progress on this.  I did a lot of googling and found some stuff to run manually in 389 ds through ldapmodify commands to clean RUVs.  During this process the server crashed and when it came back online, suddenly all my ghost RUVs were visible through ipa-replica-manage list-ruv.  It was really strange, I had like 5 of them from winsync agreements that kept failing and needing re-initialization, and another 5 from my earlier re-installations of the 2 other domain controllers.
>
> I ran some more ruv cleanup commands through ldap and they all appear to be gone.  I'm not sure how the crash suddenly made them visible though or why they had to be cleaned through ldapmodify directly and ipa-replica-manage could neither see nor clean them.
After a crash the RUV could be rebuilt from the changelog, and the changelog could contain references to cleaned ReplicaIds and so they came to live again. The cleanallruv task was enhanced to also clean the changelog, but this fix is in 1.3.4.2+.

-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project




More information about the Freeipa-users mailing list