[Freeipa-devel] [PATCH] WIP backup and restore

Mon Mar 25 18:08:15 UTC 2013

On 03/23/2013 05:06 AM, Rob Crittenden wrote:
> TL;DR. Sorry.
>
> Here is my current progress on backup and restore. I have not documented
> any of this in the Implementation section of the wiki yet.
>
> I've added two new commands, ipa-backup and ipa-restore.
>
> The files go into /var/lib/ipa/backup. When doing a restore you should
> only reference the directory in backup, not the full path. This needs to
> change, but it is what it is.
>
> There are strict limits on what can be restored where. Only exact
> matching hostnames and versions are allowed right now. We can probably
> relax the hostname requirement if we're only restoring data, and the
> version perhaps for only the the first two values (so you can restore a
> 3.0.0 backup on 3.0.1 but not on 3.1.0).

Do we also need to limit the versions of Dogtag, 389, Kerberos...?
Or is what they put in /var/lib guaranteed portable across versions?

> I've done 99.99% of testing in F-18 with a single instance. I did some
> initial testing in 6.4 so I think the roots are there, but they are
> untested currently.
>
> I spent a lot of time going in circles when doing a restore and getting
> replication right. I'm open to discussion on this, but my purpose for
> restoration was to define a new baseline for the IPA installation. It is
> basically the catastrophic case, where your data is
> hosed/untested/whatever and you just want to get back to some sane point.
>
> Ok, so given that we need to make sure that any other masters don't send
> us any updates in their changelog when they come back online. So I use a
> new feature in 1.3.0 to disable the replication agreements. This works
> really, really well.
>
> The only problem is you have to re-enable the agreement in order to
> re-initialize a master (https://fedorahosted.org/389/ticket/47304). I
> have the feeling that this leaves a small window where replication can
> occur and pollute our restored master. I noticed that we do a
> force_sync() when doing a full re-init. It may be that if we dropped it
> that would also mitigate this.
>
> I did the majority of my testing using an A <-> B <-> C replication
> topology. This exposed a lot of issues that A <-> B did not. I don't
> know if it was the third server or having the extra hop, but I hopefully
> closed a bunch of the corner cases.
>
> So what I would do is either a full or a data restore on A. This would
> break replication on B and C, as expected. So in this scenario A and B
> are CAs.
>
> Then I'd run this on B:
>
> # ipa-replica-manage re-initialize --from=A
> # ipa-csreplica-manage re-initialize --from=A
>
> Once that was done I'd run this on C:
>
> # ipa-replica-manage re-initialize --from=B
>
> The restoration of the dogtag databases was the last thing I did so it
> isn't super-well tested. I had to move a fair bit of code around. I
> think it's the sort of thing that will work when the everything goes
> well but exceptions may not be well-handled.
>
> The man pages are just a shel right now, they need a lot of work.
>
> It should also be possible to do a full system restore. I tested with:
>
> # ipa-server-install ...
> # <add a bunch of data, 100 entries or more>
> # ipa-backup
> # add one or more users
> # ipa-server-install --uninstall -U
> # ipa-restore ipa-full-...
>
> The last batch of users should be gone. I did similar tests with the
> A/B/C set up.
>
> I ran the unit tests against it and all was well.
>
> I have done zero testing in a Trust environment, though at least some of
> the files are backed up in the full case. I did some testing with DNS.
>
> I did no testing of a master that was down at the time of restoration
> and then was brought online later, so it never had its replication
> agreement disabled. I have the feeling it will hose the data.
>
> I have some concern over space requirements. Because I tar things up one
> basically needs double-the backup space in order to do a restore, and a
> bit more when encrypted. I'm open to suggestions on how to store the
> data, but we have many files for the 389-ds bak backup and I didn't want
> to have to encrypt them all.
>
> On that note, that I'm doing a db2bak AND a db2ldif backup and currently
> using the ldif2db for the restore. My original intention was to use
> db2bak/bak2db in order to retain the changelog, but retaining the
> changelog is actually a problem if we're restoring to a known state and
> forcing a re-init. It wouldn't take much to convince me to drop that,
> which reduces the # of files we have to deal with.
>
> I also snuck in a change to the way that replication is displayed. It
> has long bothered me that we print out an Updating message during
> replication because it gives no context. I changed it to be more of a
> progress indicator, using \r to over-write itself and include the # of
> seconds elapsed. The log files are still readable but I'd hate to see
> what this looks like in a typescript :-)
>
> Finally, sorry about the huge patch. I looked at the incremental commits
> I had done and I didn't think they'd tell much of a story. I thought
> about breaking some of the pieces out, but there is a lot of
> interdependency, so you'd need everything eventually anyway, so here you
> go, a 70k patch.
>
> A quick roadmap is (and I'll skip the obvious):
>
> ipa-csreplica-manage: had to move a bunch of code to replication.py so I
> could use the CSReplicaManager class in ipa-restore. The rest revolves
> around handling exceptions raised by get_cs_replication_manager() and
> enabling replication on both sides to do a re-init.
>
> replication.py: the changes mentioned to pretty-print updates, a notable
> change where a re-init automatically enables replication, the new
> enable/disable code and all the code from ipa-csreplica-manage.
>
> ipa-backup/ipa-restore use the new admintool framework. I generally have
> a good impression of the new framework, but I found myself having to run
> in debug mode more than I'd have liked in order to discover some types
> of errors.

That's good to hear!

I think while developing, you should run with -v to get all the output.
And for production, please have the commands log by default (set 
log_file_name).

> ipa-backup does all its binds using ldapi. ipa-restore needs the DM
> password because we need to contact remote servers to enable/disable
> replication.

The restore assumes that ipa is already installed, right? I can't just 
run it on a clean machine? Id would be good to check/document this.

I looked in the backup tarball and saw dirsrv PID file and lock 
directories. Are these needed?

The tool runners (install/tools/ipa-backup and 
install/tools/ipa-restore) are missing, so IPA doesn't build. Probably 
just a forgotten git add.

The patch adds an extra backslash in install/tools/man/Makefile.am; 
consider adding $(NULL) at the end.

Backup.dirs, Backup.files, Backup.logs:
The code modifies these class-level attributes. This means you can't run 
the command more than once in one Python session, so it can't be used as 
a library (e.g. in a hypothetical test suite).
Please make the class atrributes immutable (tuples), then in __init__ do 
`self.dirs = list(self.dirs)` to to get instance-specific lists.

Code that creates temporary files/directories or does chdir() should be 
next to (or in) a try block whose finally: restores things.

Instead of star imports (from ipapython.ipa_log_manager import *), 
please explicitly import whatever you're using from that package. In 
this case, nothing at all!

If there's a get_connection() method, it should return the connection, 
and it should always be used to get it. Store the connection in 
self._conn, and rather than:
     self.get_connection()
     self.conn.do_work()
do:
     conn = self.get_connection()
     conn.do_work()
This makes forgetting to call get_connection() impossible.

If a variable is only used in a single method, like `filename` in 
Restore.extract_backup, don't store it in the admintool object.
In general, try to avoid putting data in self if possible. It's 
convenient but it allows complex interdependencies.
(Yes, I'm guilty of not following this myself.)

In several places, the backup tool logs critical errors and then just 
continues on. Is that by design? I think a nonzero exit code would be 
appropriate.

In the ipa-restore man page, --backend is not documented.

In db2ldif, db2bak, etc., a more conscise way to get the time string is 
`time.strftime('export_%Y_%m_%d_%H_%M_%S')`.

When validating --gpg-keyring, it would be good to check both files 
(.sec and .pub).

Here:
     if (self.backup_type != 'FULL' and not options.data_only and
         not instances):
         raise admintool.ScriptError('Cannot restore a data backup into 
an empty system')
did you mean `(self.backup_type != 'FULL' or options.data_only) and not 
instances`? (I'd introduce a `data_only` flag to prevent confusion.)

In the ReplicationManager.check_repl_init reporting: use 
int(d.total_seconds()) instead of d.seconds, as the latter doesn't 
include full days. I don't think anyone would wait long enough for the 
overflow, but better to be correct.

-- 
Petr³