[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 15:55:00 UTC 2012

Rob Crittenden wrote:
> Rob Crittenden wrote:
>> Martin Kosek wrote:
>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>>>> Rob Crittenden wrote:
>>>>> Martin Kosek wrote:
>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>>>> Martin Kosek wrote:
>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>>>> other servers.
>>>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>>>> causes the
>>>>>>>>> server to calculate changes against a server that may no longer
>>>>>>>>> exist.
>>>>>>>>>
>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>>>> available
>>>>>>>>> replicas to clean this RUV data.
>>>>>>>>>
>>>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>>>> clean things up.
>>>>>>>>>
>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the
>>>>>>>>> time
>>>>>>>>> the
>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
>>>>>>>>> may be
>>>>>>>>> re-propogated around.
>>>>>>>>>
>>>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>>>> servers we
>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>>>> replication id of a
>>>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>>>
>>>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>>>> replica id that
>>>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>>>> enough scary
>>>>>>>>> warnings about this.
>>>>>>>>>
>>>>>>>>> rob
>>>>>>>>>
>>>>>>>>
>>>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>>>> with the
>>>>>>>> previous version.
>>>>>>>>
>>>>>>>> This is what I found during review:
>>>>>>>>
>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I
>>>>>>>> think
>>>>>>>> it would
>>>>>>>> help if we for example have all info for commands indented. This
>>>>>>>> way
>>>>>>>> user could
>>>>>>>> simply over-look the new commands in the man page.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
>>>>>>>> them
>>>>>>>> consistent with the rest of the commands (re-initialize,
>>>>>>>> force-sync).
>>>>>>>>
>>>>>>>>
>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>>>> unattended way
>>>>>>>> (for better testing), i.e. respect --force option as we already
>>>>>>>> do for
>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>>>> future.
>>>>>>>>
>>>>>>>>
>>>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>>>> well for
>>>>>>>> CTRL+D:
>>>>>>>>
>>>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>>>> Clean the Replication Update Vector for
>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>
>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>> consistency. Be very careful.
>>>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>>>
>>>>>>>>
>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>>>> confusing
>>>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>>>
>>>>>>>> # ipa-replica-manage clean_ruv
>>>>>>>> Usage: ipa-replica-manage [options]
>>>>>>>>
>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>>>> force-sync |
>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>>>
>>>>>>>> It seems you just forgot to specify the error message in the
>>>>>>>> command
>>>>>>>> definition
>>>>>>>>
>>>>>>>>
>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
>>>>>>>> with an
>>>>>>>> unexpected error:
>>>>>>>>
>>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>>>> Clean the Replication Update Vector for
>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>
>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>> consistency. Be very careful.
>>>>>>>> Continue to clean? [no]: y
>>>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>>>
>>>>>>>>
>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>> cleanAllRUV_task: failed
>>>>>>>> to connect to repl        agreement connection
>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>>>
>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>
>>>>>>>>
>>>>>>>> tree,cn=config), error 105
>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>> cleanAllRUV_task: replica
>>>>>>>> (cn=meTovm-055.idm.lab.
>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun
>>>>>>>> the
>>>>>>>> CLEANALLRUV task on this replica.
>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>> cleanAllRUV_task: Task
>>>>>>>> failed (1)
>>>>>>>>
>>>>>>>> In this case I think we should inform user that the command failed,
>>>>>>>> possibly
>>>>>>>> because of disconnected replicas and that they could enable the
>>>>>>>> replicas and
>>>>>>>> try again.
>>>>>>>>
>>>>>>>>
>>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>>>> +            # We can't make the server we're removing read-only
>>>>>>>> but
>>>>>>>> +            # this isn't a show-stopper
>>>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>>>> read-only,
>>>>>>>> continuing anyway")
>>>>>>>> +            pass
>>>>>>>>
>>>>>>>
>>>>>>> I think this addresses everything.
>>>>>>>
>>>>>>> rob
>>>>>>
>>>>>> Thanks, almost there! I just found one more issue which needs to be
>>>>>> fixed
>>>>>> before we push:
>>>>>>
>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>>>> Directory Manager password:
>>>>>>
>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>>>> removal
>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>> "Can't
>>>>>> contact LDAP server"}
>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>>
>>>>>> There were issues removing a connection: %d format: a number is
>>>>>> required, not str
>>>>>>
>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>> "Can't
>>>>>> contact LDAP server"}
>>>>>>
>>>>>> This is a traceback I retrieved:
>>>>>> Traceback (most recent call last):
>>>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>>>       repl1.cleanallruv(replica_id)
>>>>>>     File
>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>>>> line 1094, in cleanallruv
>>>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id
>>>>>> %d" %
>>>>>> replicaId)
>>>>>>
>>>>>>
>>>>>> The problem here is that you don't convert replica_id to int in this
>>>>>> part:
>>>>>> +    replica_id = None
>>>>>> +    if repl2:
>>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>>>> +    else:
>>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>>>> +        for (netloc, rid) in servers:
>>>>>> +            if netloc.startswith(replica2):
>>>>>> +                replica_id = rid
>>>>>> +                break
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>
>>>>> Updated patch using new mechanism in 389-ds-base. This should more
>>>>> thoroughly clean out RUV data when a replica is being deleted, and
>>>>> provide for a way to delete RUV data afterwards too if necessary.
>>>>>
>>>>> rob
>>>>
>>>> Rebased patch
>>>>
>>>> rob
>>>>
>>>
>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
>>> elsewhere.
>>>
>>> 1) The following KeyboardInterrupt except class looks suspicious. I
>>> know why
>>> you have it there, but since it is generally a bad thing to do, some
>>> comment
>>> why it is needed would be useful.
>>>
>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
>>> dirman_passwd,
>>> force=False):
>>>       repl1.delete_agreement(replica2)
>>>       repl1.delete_referral(replica2)
>>>
>>> +    if type1 == replication.IPA_REPLICA:
>>> +        if repl2:
>>> +            ruv = repl2._get_replica_id(repl2.conn, None)
>>> +        else:
>>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
>>> dirman_passwd)
>>> +
>>> +        try:
>>> +            repl1.cleanallruv(ruv)
>>> +        except KeyboardInterrupt:
>>> +            pass
>>> +
>>>
>>> Maybe you just wanted to do some cleanup and then "raise" again?
>>
>> No, it is there because it is safe to break out of it. The task will
>> continue to run. I added some verbiage.
>>
>>>
>>> 2) This is related to 1), but when some replica is down,
>>> "ipa-replica-manage
>>> del" may wait indefinitely when some remote replica is down, right?
>>>
>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>> Deleting a master is irreversible.
>>> To reconnect to the remote master you will need to prepare a new
>>> replica file
>>> and re-install.
>>> Continue to delete? [no]: y
>>> ipa: INFO: Setting agreement
>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>
>>>
>>> tree,cn=config schedule to 2358-2359 0 to force synch
>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>
>>>
>>> tree,cn=config
>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
>>> acquired
>>> successfully: Incremental update succeeded: start: 0: end: 0
>>> Background task created to clean replication data
>>>
>>> ... after about a minute I hit CTRL+C
>>>
>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
>>> 'vm-055.idm.lab.bos.redhat.com'
>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
>>> does not
>>> contain 'vm-055.idm.lab.bos.redhat.com.'
>>> You may need to manually remove them from the tree
>>>
>>> I think it would be better to inform user that some remote replica is
>>> down or
>>> at least that we are waiting for the task to complete. Something like
>>> that:
>>>
>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>> ...
>>> Background task created to clean replication data
>>> Replication data clean up may take very long time if some replica is
>>> unreachable
>>> Hit CTRL+C to interrupt the wait
>>> ^C Clean up wait interrupted
>>> ....
>>> [continue with del]
>>
>> Yup, did this in #1.
>>
>>>
>>> 3) (minor) When there is a cleanruv task running and you run
>>> "ipa-replica-manage del", there is a unexpected error message with
>>> duplicate
>>> task object in LDAP:
>>>
>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
>>> removal
>>> FAIL
>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>> contact LDAP server"}
>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>
>>> There were issues removing a connection: This entry already exists
>>> <<<<<<<<<
>>>
>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>> contact LDAP server"}
>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
>>> does not
>>> contain 'vm-072.idm.lab.bos.redhat.com.'
>>> You may need to manually remove them from the tree
>>>
>>>
>>> I think it should be enough to just catch for "entry already exists" in
>>> cleanallruv function, and in such case print a relevant error message
>>> bail out.
>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
>>
>> Good catch, fixed.
>>
>>>
>>>
>>> 4) (minor): In make_readonly function, there is a redundant "pass"
>>> statement:
>>>
>>> +    def make_readonly(self):
>>> +        """
>>> +        Make the current replication agreement read-only.
>>> +        """
>>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
>>> +                ('cn', 'plugins'), ('cn', 'config'))
>>> +
>>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
>>> +        try:
>>> +            self.conn.modify_s(dn, mod)
>>> +        except ldap.INSUFFICIENT_ACCESS:
>>> +            # We can't make the server we're removing read-only but
>>> +            # this isn't a show-stopper
>>> +            root_logger.debug("No permission to switch replica to
>>> read-only,
>>> continuing anyway")
>>> +            pass         <<<<<<<<<<<<<<<
>>
>> Yeah, this is one of my common mistakes. I put in a pass initially, then
>> add logging in front of it and forget to delete the pass. Its gone now.
>>
>>>
>>>
>>> 5) In clean_ruv, I think allowing a --force option to bypass the
>>> user_input
>>> would be helpful (at least for test automation):
>>>
>>> +    if not ipautil.user_input("Continue to clean?", False):
>>> +        sys.exit("Aborted")
>>
>> Yup, added.
>>
>> rob
>
> Slightly revised patch. I still had a window open with one unsaved change.
>
> rob
>

Apparently there were two unsaved changes, one of which was lost. This 
adds in the 'entry already exists' fix.

rob

-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeipa-rcrit-1031-7-cleanruv.patch
Type: text/x-diff
Size: 14819 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/freeipa-devel/attachments/20120906/6794a8f2/attachment.bin>