[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Fri Sep 7 08:34:07 UTC 2012

On Thu, 2012-09-06 at 17:17 -0400, Rob Crittenden wrote:
> Martin Kosek wrote:
> > On 09/06/2012 05:55 PM, Rob Crittenden wrote:
> >> Rob Crittenden wrote:
> >>> Rob Crittenden wrote:
> >>>> Martin Kosek wrote:
> >>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
> >>>>>> Rob Crittenden wrote:
> >>>>>>> Martin Kosek wrote:
> >>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
> >>>>>>>>> Martin Kosek wrote:
> >>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
> >>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
> >>>>>>>>>>> other servers.
> >>>>>>>>>>> This can confuse things if the replica is re-added, and it also
> >>>>>>>>>>> causes the
> >>>>>>>>>>> server to calculate changes against a server that may no longer
> >>>>>>>>>>> exist.
> >>>>>>>>>>>
> >>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
> >>>>>>>>>>> available
> >>>>>>>>>>> replicas to clean this RUV data.
> >>>>>>>>>>>
> >>>>>>>>>>> This patch will create this task at deletion time to hopefully
> >>>>>>>>>>> clean things up.
> >>>>>>>>>>>
> >>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the
> >>>>>>>>>>> time
> >>>>>>>>>>> the
> >>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
> >>>>>>>>>>> may be
> >>>>>>>>>>> re-propogated around.
> >>>>>>>>>>>
> >>>>>>>>>>> To make things easier in this case I've added two new commands to
> >>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
> >>>>>>>>>>> servers we
> >>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
> >>>>>>>>>>> replication id of a
> >>>>>>>>>>> server that no longer exists to try the cleanallruv step again.
> >>>>>>>>>>>
> >>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a
> >>>>>>>>>>> replica id that
> >>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in
> >>>>>>>>>>> enough scary
> >>>>>>>>>>> warnings about this.
> >>>>>>>>>>>
> >>>>>>>>>>> rob
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Good work there, this should make cleaning RUVs much easier than
> >>>>>>>>>> with the
> >>>>>>>>>> previous version.
> >>>>>>>>>>
> >>>>>>>>>> This is what I found during review:
> >>>>>>>>>>
> >>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I
> >>>>>>>>>> think
> >>>>>>>>>> it would
> >>>>>>>>>> help if we for example have all info for commands indented. This
> >>>>>>>>>> way
> >>>>>>>>>> user could
> >>>>>>>>>> simply over-look the new commands in the man page.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
> >>>>>>>>>> them
> >>>>>>>>>> consistent with the rest of the commands (re-initialize,
> >>>>>>>>>> force-sync).
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
> >>>>>>>>>> unattended way
> >>>>>>>>>> (for better testing), i.e. respect --force option as we already
> >>>>>>>>>> do for
> >>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
> >>>>>>>>>> future.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 4) (minor) The new question (and the del too) does not react too
> >>>>>>>>>> well for
> >>>>>>>>>> CTRL+D:
> >>>>>>>>>>
> >>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force
> >>>>>>>>>> Clean the Replication Update Vector for
> >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
> >>>>>>>>>>
> >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
> >>>>>>>>>> longer replicate so it may miss updates while the process
> >>>>>>>>>> is running. It would need to be re-initialized to maintain
> >>>>>>>>>> consistency. Be very careful.
> >>>>>>>>>> Continue to clean? [no]: unexpected error:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
> >>>>>>>>>> confusing
> >>>>>>>>>> as it reports that command is wrong and not the parameter:
> >>>>>>>>>>
> >>>>>>>>>> # ipa-replica-manage clean_ruv
> >>>>>>>>>> Usage: ipa-replica-manage [options]
> >>>>>>>>>>
> >>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
> >>>>>>>>>> force-sync |
> >>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
> >>>>>>>>>>
> >>>>>>>>>> It seems you just forgot to specify the error message in the
> >>>>>>>>>> command
> >>>>>>>>>> definition
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
> >>>>>>>>>> with an
> >>>>>>>>>> unexpected error:
> >>>>>>>>>>
> >>>>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
> >>>>>>>>>> Clean the Replication Update Vector for
> >>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
> >>>>>>>>>>
> >>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
> >>>>>>>>>> longer replicate so it may miss updates while the process
> >>>>>>>>>> is running. It would need to be re-initialized to maintain
> >>>>>>>>>> consistency. Be very careful.
> >>>>>>>>>> Continue to clean? [no]: y
> >>>>>>>>>> unexpected error: {'desc': 'Operations error'}
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
> >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
> >>>>>>>>>> cleanAllRUV_task: failed
> >>>>>>>>>> to connect to repl        agreement connection
> >>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
> >>>>>>>>>>
> >>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> tree,cn=config), error 105
> >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
> >>>>>>>>>> cleanAllRUV_task: replica
> >>>>>>>>>> (cn=meTovm-055.idm.lab.
> >>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun
> >>>>>>>>>> the
> >>>>>>>>>> CLEANALLRUV task on this replica.
> >>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
> >>>>>>>>>> cleanAllRUV_task: Task
> >>>>>>>>>> failed (1)
> >>>>>>>>>>
> >>>>>>>>>> In this case I think we should inform user that the command failed,
> >>>>>>>>>> possibly
> >>>>>>>>>> because of disconnected replicas and that they could enable the
> >>>>>>>>>> replicas and
> >>>>>>>>>> try again.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
> >>>>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
> >>>>>>>>>> +            # We can't make the server we're removing read-only
> >>>>>>>>>> but
> >>>>>>>>>> +            # this isn't a show-stopper
> >>>>>>>>>> +            root_logger.debug("No permission to switch replica to
> >>>>>>>>>> read-only,
> >>>>>>>>>> continuing anyway")
> >>>>>>>>>> +            pass
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I think this addresses everything.
> >>>>>>>>>
> >>>>>>>>> rob
> >>>>>>>>
> >>>>>>>> Thanks, almost there! I just found one more issue which needs to be
> >>>>>>>> fixed
> >>>>>>>> before we push:
> >>>>>>>>
> >>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
> >>>>>>>> Directory Manager password:
> >>>>>>>>
> >>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
> >>>>>>>> removal
> >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
> >>>>>>>> "Can't
> >>>>>>>> contact LDAP server"}
> >>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
> >>>>>>>>
> >>>>>>>> There were issues removing a connection: %d format: a number is
> >>>>>>>> required, not str
> >>>>>>>>
> >>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
> >>>>>>>> "Can't
> >>>>>>>> contact LDAP server"}
> >>>>>>>>
> >>>>>>>> This is a traceback I retrieved:
> >>>>>>>> Traceback (most recent call last):
> >>>>>>>>      File "/sbin/ipa-replica-manage", line 425, in del_master
> >>>>>>>>        del_link(realm, r, hostname, options.dirman_passwd, force=True)
> >>>>>>>>      File "/sbin/ipa-replica-manage", line 271, in del_link
> >>>>>>>>        repl1.cleanallruv(replica_id)
> >>>>>>>>      File
> >>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
> >>>>>>>> line 1094, in cleanallruv
> >>>>>>>>        root_logger.debug("Creating CLEANALLRUV task for replica id
> >>>>>>>> %d" %
> >>>>>>>> replicaId)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The problem here is that you don't convert replica_id to int in this
> >>>>>>>> part:
> >>>>>>>> +    replica_id = None
> >>>>>>>> +    if repl2:
> >>>>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
> >>>>>>>> +    else:
> >>>>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
> >>>>>>>> +        for (netloc, rid) in servers:
> >>>>>>>> +            if netloc.startswith(replica2):
> >>>>>>>> +                replica_id = rid
> >>>>>>>> +                break
> >>>>>>>>
> >>>>>>>> Martin
> >>>>>>>>
> >>>>>>>
> >>>>>>> Updated patch using new mechanism in 389-ds-base. This should more
> >>>>>>> thoroughly clean out RUV data when a replica is being deleted, and
> >>>>>>> provide for a way to delete RUV data afterwards too if necessary.
> >>>>>>>
> >>>>>>> rob
> >>>>>>
> >>>>>> Rebased patch
> >>>>>>
> >>>>>> rob
> >>>>>>
> >>>>>
> >>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
> >>>>> elsewhere.
> >>>>>
> >>>>> 1) The following KeyboardInterrupt except class looks suspicious. I
> >>>>> know why
> >>>>> you have it there, but since it is generally a bad thing to do, some
> >>>>> comment
> >>>>> why it is needed would be useful.
> >>>>>
> >>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
> >>>>> dirman_passwd,
> >>>>> force=False):
> >>>>>        repl1.delete_agreement(replica2)
> >>>>>        repl1.delete_referral(replica2)
> >>>>>
> >>>>> +    if type1 == replication.IPA_REPLICA:
> >>>>> +        if repl2:
> >>>>> +            ruv = repl2._get_replica_id(repl2.conn, None)
> >>>>> +        else:
> >>>>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
> >>>>> dirman_passwd)
> >>>>> +
> >>>>> +        try:
> >>>>> +            repl1.cleanallruv(ruv)
> >>>>> +        except KeyboardInterrupt:
> >>>>> +            pass
> >>>>> +
> >>>>>
> >>>>> Maybe you just wanted to do some cleanup and then "raise" again?
> >>>>
> >>>> No, it is there because it is safe to break out of it. The task will
> >>>> continue to run. I added some verbiage.
> >>>>
> >>>>>
> >>>>> 2) This is related to 1), but when some replica is down,
> >>>>> "ipa-replica-manage
> >>>>> del" may wait indefinitely when some remote replica is down, right?
> >>>>>
> >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
> >>>>> Deleting a master is irreversible.
> >>>>> To reconnect to the remote master you will need to prepare a new
> >>>>> replica file
> >>>>> and re-install.
> >>>>> Continue to delete? [no]: y
> >>>>> ipa: INFO: Setting agreement
> >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
> >>>>>
> >>>>>
> >>>>>
> >>>>> tree,cn=config schedule to 2358-2359 0 to force synch
> >>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
> >>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
> >>>>>
> >>>>>
> >>>>>
> >>>>> tree,cn=config
> >>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
> >>>>> acquired
> >>>>> successfully: Incremental update succeeded: start: 0: end: 0
> >>>>> Background task created to clean replication data
> >>>>>
> >>>>> ... after about a minute I hit CTRL+C
> >>>>>
> >>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
> >>>>> 'vm-055.idm.lab.bos.redhat.com'
> >>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
> >>>>> does not
> >>>>> contain 'vm-055.idm.lab.bos.redhat.com.'
> >>>>> You may need to manually remove them from the tree
> >>>>>
> >>>>> I think it would be better to inform user that some remote replica is
> >>>>> down or
> >>>>> at least that we are waiting for the task to complete. Something like
> >>>>> that:
> >>>>>
> >>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
> >>>>> ...
> >>>>> Background task created to clean replication data
> >>>>> Replication data clean up may take very long time if some replica is
> >>>>> unreachable
> >>>>> Hit CTRL+C to interrupt the wait
> >>>>> ^C Clean up wait interrupted
> >>>>> ....
> >>>>> [continue with del]
> >>>>
> >>>> Yup, did this in #1.
> >>>>
> >>>>>
> >>>>> 3) (minor) When there is a cleanruv task running and you run
> >>>>> "ipa-replica-manage del", there is a unexpected error message with
> >>>>> duplicate
> >>>>> task object in LDAP:
> >>>>>
> >>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
> >>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
> >>>>> removal
> >>>>> FAIL
> >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
> >>>>> contact LDAP server"}
> >>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
> >>>>>
> >>>>> There were issues removing a connection: This entry already exists
> >>>>> <<<<<<<<<
> >>>>>
> >>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
> >>>>> contact LDAP server"}
> >>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
> >>>>> does not
> >>>>> contain 'vm-072.idm.lab.bos.redhat.com.'
> >>>>> You may need to manually remove them from the tree
> >>>>>
> >>>>>
> >>>>> I think it should be enough to just catch for "entry already exists" in
> >>>>> cleanallruv function, and in such case print a relevant error message
> >>>>> bail out.
> >>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
> >>>>
> >>>> Good catch, fixed.
> >>>>
> >>>>>
> >>>>>
> >>>>> 4) (minor): In make_readonly function, there is a redundant "pass"
> >>>>> statement:
> >>>>>
> >>>>> +    def make_readonly(self):
> >>>>> +        """
> >>>>> +        Make the current replication agreement read-only.
> >>>>> +        """
> >>>>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
> >>>>> +                ('cn', 'plugins'), ('cn', 'config'))
> >>>>> +
> >>>>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
> >>>>> +        try:
> >>>>> +            self.conn.modify_s(dn, mod)
> >>>>> +        except ldap.INSUFFICIENT_ACCESS:
> >>>>> +            # We can't make the server we're removing read-only but
> >>>>> +            # this isn't a show-stopper
> >>>>> +            root_logger.debug("No permission to switch replica to
> >>>>> read-only,
> >>>>> continuing anyway")
> >>>>> +            pass         <<<<<<<<<<<<<<<
> >>>>
> >>>> Yeah, this is one of my common mistakes. I put in a pass initially, then
> >>>> add logging in front of it and forget to delete the pass. Its gone now.
> >>>>
> >>>>>
> >>>>>
> >>>>> 5) In clean_ruv, I think allowing a --force option to bypass the
> >>>>> user_input
> >>>>> would be helpful (at least for test automation):
> >>>>>
> >>>>> +    if not ipautil.user_input("Continue to clean?", False):
> >>>>> +        sys.exit("Aborted")
> >>>>
> >>>> Yup, added.
> >>>>
> >>>> rob
> >>>
> >>> Slightly revised patch. I still had a window open with one unsaved change.
> >>>
> >>> rob
> >>>
> >>
> >> Apparently there were two unsaved changes, one of which was lost. This adds in
> >> the 'entry already exists' fix.
> >>
> >> rob
> >>
> >
> > Just one last thing (otherwise the patch is OK) - I don't think this is what we
> > want :-)
> >
> > # ipa-replica-manage clean-ruv 8
> > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
> >
> > Cleaning the wrong replica ID will cause that server to no
> > longer replicate so it may miss updates while the process
> > is running. It would need to be re-initialized to maintain
> > consistency. Be very careful.
> > Continue to clean? [no]: y   <<<<<<
> > Aborted
> >
> >
> > Nor this exception, (your are checking for wrong exception):
> >
> > # ipa-replica-manage clean-ruv 8
> > Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
> >
> > Cleaning the wrong replica ID will cause that server to no
> > longer replicate so it may miss updates while the process
> > is running. It would need to be re-initialized to maintain
> > consistency. Be very careful.
> > Continue to clean? [no]:
> > unexpected error: This entry already exists
> >
> > This is the exception:
> >
> > Traceback (most recent call last):
> >    File "/sbin/ipa-replica-manage", line 651, in <module>
> >      main()
> >    File "/sbin/ipa-replica-manage", line 648, in main
> >      clean_ruv(realm, args[1], options)
> >    File "/sbin/ipa-replica-manage", line 373, in clean_ruv
> >      thisrepl.cleanallruv(ruv)
> >    File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
> > line 1136, in cleanallruv
> >      self.conn.addEntry(e)
> >    File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in
> > addEntry
> >      self.__handle_errors(e, arg_desc=arg_desc)
> >    File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in
> > __handle_errors
> >      raise errors.DuplicateEntry()
> > ipalib.errors.DuplicateEntry: This entry already exists
> >
> > Martin
> >
> 
> Fixed that and a couple of other problems. When doing a disconnect we 
> should not also call clean-ruv.
> 
> I also got tired of seeing crappy error messages so I added a little 
> convert utility.
> 
> rob

ACK. Do you want to want with the pushing until the winsync issue in
CLEANALLRUV is fixed? This is the ticket I created:
https://fedorahosted.org/389/ticket/450

Martin