[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Rob Crittenden rcritten at redhat.com
Thu Sep 6 15:19:20 UTC 2012


Martin Kosek wrote:
> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>> Rob Crittenden wrote:
>>> Martin Kosek wrote:
>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>> Martin Kosek wrote:
>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>> other servers.
>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>> causes the
>>>>>>> server to calculate changes against a server that may no longer exist.
>>>>>>>
>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>> available
>>>>>>> replicas to clean this RUV data.
>>>>>>>
>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>> clean things up.
>>>>>>>
>>>>>>> It isn't perfect. If any replica is down or unavailable at the time
>>>>>>> the
>>>>>>> cleanruv task fires, and then comes back up, the old RUV data may be
>>>>>>> re-propogated around.
>>>>>>>
>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>> servers we
>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>> replication id of a
>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>
>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>> replica id that
>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>> enough scary
>>>>>>> warnings about this.
>>>>>>>
>>>>>>> rob
>>>>>>>
>>>>>>
>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>> with the
>>>>>> previous version.
>>>>>>
>>>>>> This is what I found during review:
>>>>>>
>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I think
>>>>>> it would
>>>>>> help if we for example have all info for commands indented. This way
>>>>>> user could
>>>>>> simply over-look the new commands in the man page.
>>>>>>
>>>>>>
>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make them
>>>>>> consistent with the rest of the commands (re-initialize, force-sync).
>>>>>>
>>>>>>
>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>> unattended way
>>>>>> (for better testing), i.e. respect --force option as we already do for
>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>> future.
>>>>>>
>>>>>>
>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>> well for
>>>>>> CTRL+D:
>>>>>>
>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>> Clean the Replication Update Vector for
>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>
>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>> longer replicate so it may miss updates while the process
>>>>>> is running. It would need to be re-initialized to maintain
>>>>>> consistency. Be very careful.
>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>
>>>>>>
>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>> confusing
>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>
>>>>>> # ipa-replica-manage clean_ruv
>>>>>> Usage: ipa-replica-manage [options]
>>>>>>
>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>> force-sync |
>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>
>>>>>> It seems you just forgot to specify the error message in the command
>>>>>> definition
>>>>>>
>>>>>>
>>>>>> 6) When the remote replica is down, the clean_ruv command fails with an
>>>>>> unexpected error:
>>>>>>
>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>> Clean the Replication Update Vector for
>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>
>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>> longer replicate so it may miss updates while the process
>>>>>> is running. It would need to be re-initialized to maintain
>>>>>> consistency. Be very careful.
>>>>>> Continue to clean? [no]: y
>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>
>>>>>>
>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>> cleanAllRUV_task: failed
>>>>>> to connect to repl        agreement connection
>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>
>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>> tree,cn=config), error 105
>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>> cleanAllRUV_task: replica
>>>>>> (cn=meTovm-055.idm.lab.
>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>
>>>>>>
>>>>>>
>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun the
>>>>>> CLEANALLRUV task on this replica.
>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>> cleanAllRUV_task: Task
>>>>>> failed (1)
>>>>>>
>>>>>> In this case I think we should inform user that the command failed,
>>>>>> possibly
>>>>>> because of disconnected replicas and that they could enable the
>>>>>> replicas and
>>>>>> try again.
>>>>>>
>>>>>>
>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>> +            # We can't make the server we're removing read-only but
>>>>>> +            # this isn't a show-stopper
>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>> read-only,
>>>>>> continuing anyway")
>>>>>> +            pass
>>>>>>
>>>>>
>>>>> I think this addresses everything.
>>>>>
>>>>> rob
>>>>
>>>> Thanks, almost there! I just found one more issue which needs to be fixed
>>>> before we push:
>>>>
>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>> Directory Manager password:
>>>>
>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>> removal
>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't
>>>> contact LDAP server"}
>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>
>>>> There were issues removing a connection: %d format: a number is
>>>> required, not str
>>>>
>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't
>>>> contact LDAP server"}
>>>>
>>>> This is a traceback I retrieved:
>>>> Traceback (most recent call last):
>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>       repl1.cleanallruv(replica_id)
>>>>     File
>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>> line 1094, in cleanallruv
>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id %d" %
>>>> replicaId)
>>>>
>>>>
>>>> The problem here is that you don't convert replica_id to int in this
>>>> part:
>>>> +    replica_id = None
>>>> +    if repl2:
>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>> +    else:
>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>> +        for (netloc, rid) in servers:
>>>> +            if netloc.startswith(replica2):
>>>> +                replica_id = rid
>>>> +                break
>>>>
>>>> Martin
>>>>
>>>
>>> Updated patch using new mechanism in 389-ds-base. This should more
>>> thoroughly clean out RUV data when a replica is being deleted, and
>>> provide for a way to delete RUV data afterwards too if necessary.
>>>
>>> rob
>>
>> Rebased patch
>>
>> rob
>>
>
> 0) As I wrote in a review for your patch 1041, changelog entry slipped elsewhere.
>
> 1) The following KeyboardInterrupt except class looks suspicious. I know why
> you have it there, but since it is generally a bad thing to do, some comment
> why it is needed would be useful.
>
> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, dirman_passwd,
> force=False):
>       repl1.delete_agreement(replica2)
>       repl1.delete_referral(replica2)
>
> +    if type1 == replication.IPA_REPLICA:
> +        if repl2:
> +            ruv = repl2._get_replica_id(repl2.conn, None)
> +        else:
> +            ruv = get_ruv_by_host(realm, replica1, replica2, dirman_passwd)
> +
> +        try:
> +            repl1.cleanallruv(ruv)
> +        except KeyboardInterrupt:
> +            pass
> +
>
> Maybe you just wanted to do some cleanup and then "raise" again?

No, it is there because it is safe to break out of it. The task will 
continue to run. I added some verbiage.

>
> 2) This is related to 1), but when some replica is down, "ipa-replica-manage
> del" may wait indefinitely when some remote replica is down, right?
>
> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
> Deleting a master is irreversible.
> To reconnect to the remote master you will need to prepare a new replica file
> and re-install.
> Continue to delete? [no]: y
> ipa: INFO: Setting agreement
> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
> tree,cn=config schedule to 2358-2359 0 to force synch
> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
> tree,cn=config
> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica acquired
> successfully: Incremental update succeeded: start: 0: end: 0
> Background task created to clean replication data
>
> ... after about a minute I hit CTRL+C
>
> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
> 'vm-055.idm.lab.bos.redhat.com'
> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record does not
> contain 'vm-055.idm.lab.bos.redhat.com.'
> You may need to manually remove them from the tree
>
> I think it would be better to inform user that some remote replica is down or
> at least that we are waiting for the task to complete. Something like that:
>
> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
> ...
> Background task created to clean replication data
> Replication data clean up may take very long time if some replica is unreachable
> Hit CTRL+C to interrupt the wait
> ^C Clean up wait interrupted
> ....
> [continue with del]

Yup, did this in #1.

>
> 3) (minor) When there is a cleanruv task running and you run
> "ipa-replica-manage del", there is a unexpected error message with duplicate
> task object in LDAP:
>
> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing removal
> FAIL
> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
> contact LDAP server"}
> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>
> There were issues removing a connection: This entry already exists   <<<<<<<<<
>
> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
> contact LDAP server"}
> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record does not
> contain 'vm-072.idm.lab.bos.redhat.com.'
> You may need to manually remove them from the tree
>
>
> I think it should be enough to just catch for "entry already exists" in
> cleanallruv function, and in such case print a relevant error message bail out.
> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.

Good catch, fixed.

>
>
> 4) (minor): In make_readonly function, there is a redundant "pass" statement:
>
> +    def make_readonly(self):
> +        """
> +        Make the current replication agreement read-only.
> +        """
> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
> +                ('cn', 'plugins'), ('cn', 'config'))
> +
> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
> +        try:
> +            self.conn.modify_s(dn, mod)
> +        except ldap.INSUFFICIENT_ACCESS:
> +            # We can't make the server we're removing read-only but
> +            # this isn't a show-stopper
> +            root_logger.debug("No permission to switch replica to read-only,
> continuing anyway")
> +            pass         <<<<<<<<<<<<<<<

Yeah, this is one of my common mistakes. I put in a pass initially, then 
add logging in front of it and forget to delete the pass. Its gone now.

>
>
> 5) In clean_ruv, I think allowing a --force option to bypass the user_input
> would be helpful (at least for test automation):
>
> +    if not ipautil.user_input("Continue to clean?", False):
> +        sys.exit("Aborted")

Yup, added.

rob

-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeipa-rcrit-1031-5-cleanruv.patch
Type: text/x-diff
Size: 14216 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/freeipa-devel/attachments/20120906/489664a4/attachment.bin>


More information about the Freeipa-devel mailing list