[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 16:05:18 UTC 2012

On 09/06/2012 05:55 PM, Rob Crittenden wrote:
> Rob Crittenden wrote:
>> Rob Crittenden wrote:
>>> Martin Kosek wrote:
>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>>>>> Rob Crittenden wrote:
>>>>>> Martin Kosek wrote:
>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>>>>> Martin Kosek wrote:
>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>>>>> other servers.
>>>>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>>>>> causes the
>>>>>>>>>> server to calculate changes against a server that may no longer
>>>>>>>>>> exist.
>>>>>>>>>>
>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>>>>> available
>>>>>>>>>> replicas to clean this RUV data.
>>>>>>>>>>
>>>>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>>>>> clean things up.
>>>>>>>>>>
>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the
>>>>>>>>>> time
>>>>>>>>>> the
>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
>>>>>>>>>> may be
>>>>>>>>>> re-propogated around.
>>>>>>>>>>
>>>>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>>>>> servers we
>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>>>>> replication id of a
>>>>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>>>>
>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>>>>> replica id that
>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>>>>> enough scary
>>>>>>>>>> warnings about this.
>>>>>>>>>>
>>>>>>>>>> rob
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>>>>> with the
>>>>>>>>> previous version.
>>>>>>>>>
>>>>>>>>> This is what I found during review:
>>>>>>>>>
>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I
>>>>>>>>> think
>>>>>>>>> it would
>>>>>>>>> help if we for example have all info for commands indented. This
>>>>>>>>> way
>>>>>>>>> user could
>>>>>>>>> simply over-look the new commands in the man page.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
>>>>>>>>> them
>>>>>>>>> consistent with the rest of the commands (re-initialize,
>>>>>>>>> force-sync).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>>>>> unattended way
>>>>>>>>> (for better testing), i.e. respect --force option as we already
>>>>>>>>> do for
>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>>>>> well for
>>>>>>>>> CTRL+D:
>>>>>>>>>
>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>
>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>> consistency. Be very careful.
>>>>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>>>>> confusing
>>>>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>>>>
>>>>>>>>> # ipa-replica-manage clean_ruv
>>>>>>>>> Usage: ipa-replica-manage [options]
>>>>>>>>>
>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>>>>> force-sync |
>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>>>>
>>>>>>>>> It seems you just forgot to specify the error message in the
>>>>>>>>> command
>>>>>>>>> definition
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
>>>>>>>>> with an
>>>>>>>>> unexpected error:
>>>>>>>>>
>>>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>
>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>> consistency. Be very careful.
>>>>>>>>> Continue to clean? [no]: y
>>>>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>> cleanAllRUV_task: failed
>>>>>>>>> to connect to repl        agreement connection
>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>>>>
>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> tree,cn=config), error 105
>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>> cleanAllRUV_task: replica
>>>>>>>>> (cn=meTovm-055.idm.lab.
>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun
>>>>>>>>> the
>>>>>>>>> CLEANALLRUV task on this replica.
>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>> cleanAllRUV_task: Task
>>>>>>>>> failed (1)
>>>>>>>>>
>>>>>>>>> In this case I think we should inform user that the command failed,
>>>>>>>>> possibly
>>>>>>>>> because of disconnected replicas and that they could enable the
>>>>>>>>> replicas and
>>>>>>>>> try again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>>>>> +            # We can't make the server we're removing read-only
>>>>>>>>> but
>>>>>>>>> +            # this isn't a show-stopper
>>>>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>>>>> read-only,
>>>>>>>>> continuing anyway")
>>>>>>>>> +            pass
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think this addresses everything.
>>>>>>>>
>>>>>>>> rob
>>>>>>>
>>>>>>> Thanks, almost there! I just found one more issue which needs to be
>>>>>>> fixed
>>>>>>> before we push:
>>>>>>>
>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>>>>> Directory Manager password:
>>>>>>>
>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>>>>> removal
>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>> "Can't
>>>>>>> contact LDAP server"}
>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>>>
>>>>>>> There were issues removing a connection: %d format: a number is
>>>>>>> required, not str
>>>>>>>
>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>> "Can't
>>>>>>> contact LDAP server"}
>>>>>>>
>>>>>>> This is a traceback I retrieved:
>>>>>>> Traceback (most recent call last):
>>>>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>>>>       repl1.cleanallruv(replica_id)
>>>>>>>     File
>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>>>>> line 1094, in cleanallruv
>>>>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id
>>>>>>> %d" %
>>>>>>> replicaId)
>>>>>>>
>>>>>>>
>>>>>>> The problem here is that you don't convert replica_id to int in this
>>>>>>> part:
>>>>>>> +    replica_id = None
>>>>>>> +    if repl2:
>>>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>>>>> +    else:
>>>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>>>>> +        for (netloc, rid) in servers:
>>>>>>> +            if netloc.startswith(replica2):
>>>>>>> +                replica_id = rid
>>>>>>> +                break
>>>>>>>
>>>>>>> Martin
>>>>>>>
>>>>>>
>>>>>> Updated patch using new mechanism in 389-ds-base. This should more
>>>>>> thoroughly clean out RUV data when a replica is being deleted, and
>>>>>> provide for a way to delete RUV data afterwards too if necessary.
>>>>>>
>>>>>> rob
>>>>>
>>>>> Rebased patch
>>>>>
>>>>> rob
>>>>>
>>>>
>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
>>>> elsewhere.
>>>>
>>>> 1) The following KeyboardInterrupt except class looks suspicious. I
>>>> know why
>>>> you have it there, but since it is generally a bad thing to do, some
>>>> comment
>>>> why it is needed would be useful.
>>>>
>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
>>>> dirman_passwd,
>>>> force=False):
>>>>       repl1.delete_agreement(replica2)
>>>>       repl1.delete_referral(replica2)
>>>>
>>>> +    if type1 == replication.IPA_REPLICA:
>>>> +        if repl2:
>>>> +            ruv = repl2._get_replica_id(repl2.conn, None)
>>>> +        else:
>>>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
>>>> dirman_passwd)
>>>> +
>>>> +        try:
>>>> +            repl1.cleanallruv(ruv)
>>>> +        except KeyboardInterrupt:
>>>> +            pass
>>>> +
>>>>
>>>> Maybe you just wanted to do some cleanup and then "raise" again?
>>>
>>> No, it is there because it is safe to break out of it. The task will
>>> continue to run. I added some verbiage.
>>>
>>>>
>>>> 2) This is related to 1), but when some replica is down,
>>>> "ipa-replica-manage
>>>> del" may wait indefinitely when some remote replica is down, right?
>>>>
>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>> Deleting a master is irreversible.
>>>> To reconnect to the remote master you will need to prepare a new
>>>> replica file
>>>> and re-install.
>>>> Continue to delete? [no]: y
>>>> ipa: INFO: Setting agreement
>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>
>>>>
>>>>
>>>> tree,cn=config schedule to 2358-2359 0 to force synch
>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>
>>>>
>>>>
>>>> tree,cn=config
>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
>>>> acquired
>>>> successfully: Incremental update succeeded: start: 0: end: 0
>>>> Background task created to clean replication data
>>>>
>>>> ... after about a minute I hit CTRL+C
>>>>
>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
>>>> 'vm-055.idm.lab.bos.redhat.com'
>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
>>>> does not
>>>> contain 'vm-055.idm.lab.bos.redhat.com.'
>>>> You may need to manually remove them from the tree
>>>>
>>>> I think it would be better to inform user that some remote replica is
>>>> down or
>>>> at least that we are waiting for the task to complete. Something like
>>>> that:
>>>>
>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>> ...
>>>> Background task created to clean replication data
>>>> Replication data clean up may take very long time if some replica is
>>>> unreachable
>>>> Hit CTRL+C to interrupt the wait
>>>> ^C Clean up wait interrupted
>>>> ....
>>>> [continue with del]
>>>
>>> Yup, did this in #1.
>>>
>>>>
>>>> 3) (minor) When there is a cleanruv task running and you run
>>>> "ipa-replica-manage del", there is a unexpected error message with
>>>> duplicate
>>>> task object in LDAP:
>>>>
>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
>>>> removal
>>>> FAIL
>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>> contact LDAP server"}
>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>
>>>> There were issues removing a connection: This entry already exists
>>>> <<<<<<<<<
>>>>
>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>> contact LDAP server"}
>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
>>>> does not
>>>> contain 'vm-072.idm.lab.bos.redhat.com.'
>>>> You may need to manually remove them from the tree
>>>>
>>>>
>>>> I think it should be enough to just catch for "entry already exists" in
>>>> cleanallruv function, and in such case print a relevant error message
>>>> bail out.
>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
>>>
>>> Good catch, fixed.
>>>
>>>>
>>>>
>>>> 4) (minor): In make_readonly function, there is a redundant "pass"
>>>> statement:
>>>>
>>>> +    def make_readonly(self):
>>>> +        """
>>>> +        Make the current replication agreement read-only.
>>>> +        """
>>>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
>>>> +                ('cn', 'plugins'), ('cn', 'config'))
>>>> +
>>>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
>>>> +        try:
>>>> +            self.conn.modify_s(dn, mod)
>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>> +            # We can't make the server we're removing read-only but
>>>> +            # this isn't a show-stopper
>>>> +            root_logger.debug("No permission to switch replica to
>>>> read-only,
>>>> continuing anyway")
>>>> +            pass         <<<<<<<<<<<<<<<
>>>
>>> Yeah, this is one of my common mistakes. I put in a pass initially, then
>>> add logging in front of it and forget to delete the pass. Its gone now.
>>>
>>>>
>>>>
>>>> 5) In clean_ruv, I think allowing a --force option to bypass the
>>>> user_input
>>>> would be helpful (at least for test automation):
>>>>
>>>> +    if not ipautil.user_input("Continue to clean?", False):
>>>> +        sys.exit("Aborted")
>>>
>>> Yup, added.
>>>
>>> rob
>>
>> Slightly revised patch. I still had a window open with one unsaved change.
>>
>> rob
>>
> 
> Apparently there were two unsaved changes, one of which was lost. This adds in
> the 'entry already exists' fix.
> 
> rob
> 

Just one last thing (otherwise the patch is OK) - I don't think this is what we
want :-)

# ipa-replica-manage clean-ruv 8
Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389

Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]: y   <<<<<<
Aborted

Nor this exception, (your are checking for wrong exception):

# ipa-replica-manage clean-ruv 8
Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389

Cleaning the wrong replica ID will cause that server to no
longer replicate so it may miss updates while the process
is running. It would need to be re-initialized to maintain
consistency. Be very careful.
Continue to clean? [no]:
unexpected error: This entry already exists

This is the exception:

Traceback (most recent call last):
  File "/sbin/ipa-replica-manage", line 651, in <module>
    main()
  File "/sbin/ipa-replica-manage", line 648, in main
    clean_ruv(realm, args[1], options)
  File "/sbin/ipa-replica-manage", line 373, in clean_ruv
    thisrepl.cleanallruv(ruv)
  File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
line 1136, in cleanallruv
    self.conn.addEntry(e)
  File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in
addEntry
    self.__handle_errors(e, arg_desc=arg_desc)
  File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in
__handle_errors
    raise errors.DuplicateEntry()
ipalib.errors.DuplicateEntry: This entry already exists

Martin