[Freeipa-devel] Backup and Restore design

Wed Feb 20 17:19:02 UTC 2013

On 02/20/2013 09:44 AM, Rob Crittenden wrote:
> Rich Megginson wrote:
>> On 02/20/2013 08:38 AM, Rob Crittenden wrote:
>>> Simo Sorce wrote:
>>>> On Tue, 2013-02-19 at 22:43 -0500, Rob Crittenden wrote:
>>>>> I've looked into some basic backup and restore procedures for IPA. My
>>>>> findings are here: http://freeipa.org/page/V3/Backup_and_Restore
>>>>
>>>> Great summary!
>>>>
>>>> For the catastrofic failure scenario, should we mention how to put 
>>>> back
>>>> a full failed and restored machine online ? I am thinking the restored
>>>> server may be behind (even if only by a few entries) in the 
>>>> replication,
>>>> so the CSNs in the other replicas will not match.
>>>> I guess we should mention a full resync may be needed ? Or is there a
>>>> way to bring back CSNs on replicas ?
>>>
>>> Good questions. It depends on how long the machine was down and how
>>> many changes have happened. It is possible that one would want to do a
>>> full re-init. I'll add that to the design.
>>
>> The replication protocol will detect if a replica is too out of date to
>> bring up to date with an incremental update, and requires a re-init.
>
> Ok, I'll update the design with this, thanks.
>
>>
>>>
>>>> In the 'Returning to a good state' case, can we consider some split
>>>> brain approach, were we sever replication and rebuild one server at a
>>>> time ?
>>>
>>> Perhaps using a firewall, but then you run the risk of each of those
>>> servers accepting changes during the rebuild and you could end up with
>>> a lot of collisions, which sort of goes against the point of restoring
>>> to a known good state.
>>>
>>> The changelog is the key here. I'll have to ponder this one a bit, I'm
>>> a bit conflicted on the right approach.
>>>
>>>> Maybe we can think of a way to 'mark' all server as 'bad' so that on
>>>> restore the replication agreements do not need to be changed but 
>>>> changes
>>>> from 'bad' servers will not be accepted ?
>>>> I guess this crosses also a request by someone to be able to 'pause'
>>>> replication, would use the same mechanism I guess.
>>>
>>> AFAIK there is an option to pause replication now (at least in 1.3).
>>>
>>> What you can't do is drop the changelog AFAIK. That is the real
>>> problem. If you want to restore to a known state you need to drop all
>>> the changelog entries since that time. I'll check with the 389-ds team
>>> to see if that is possible. Since we know the time of the backup, we
>>> might be able to drop newer entries than that.
>>
>> Not sure what you mean - what exactly do you want to do with the 
>> changelog?
>
> As an example, if we have 2 IPA masters and we restore the data on one 
> of them, as soon as it comes back upon the other is going to push the 
> changelog onto it (as it should) so they are in sync again.
>
> So the question is, how do we restore several masters at the same time 
> without apply changings from the changelog?
>
> What I was going to ask is, can we delete all changelog entries from 
> Time X until now? That would prevent the sync issues, but it would 
> retain the part of the changelog we care about.

Is this the problem you are trying to solve?

You have a situation where some bogus data was introduced into your 
system, and that bogus data has now been replicated everywhere.  You 
want to rollback the state of everything to before the bogus data was 
introduced.  Let's assume you want to delete the bogus data and 
everything that happened after that.

The first step is to pick a server to restore, and restore that server 
from a backup.  The first problem is that this server will need to 
reject any replicated updates, but still allow regular client updates, 
after the restore process is complete (the db is in read-only mode 
during the restore).  The only way to do this now would be to first 
disable all replication agreements on all other replicas going to this 
server, which would be quite painful. Alternately - during the restore 
process, change the replica generation of the restored server - other 
servers would see the different replica generation and would refuse to 
send updates (and report lots of replication errors).

Once the first server is restored, you would just use the online or 
offline replica init procedure.

>
>>
>>>
>>>> Full system backup:
>>>> in the first part it is said the process is offline, but in the 'LDAP'
>>>> section you say ldapi is used, but that would mean DS is running ?
>>>> Also are we sure we can get all data we need via ldapi ? Are we 
>>>> going to
>>>> miss any "operational" attribute ?
>>>
>>> The full backup is offline because it is just using tar. This is sort
>>> of a brute-force backup, copying bits from A to B.
>>>
>>> The data backup is online and creates a task in 389-ds to write the
>>> data and changelog to a file. It should write everything completely.
>>> We don't do an ldapsearch.
>>>
>>> I chose to not back up in ldif because this would back up just the
>>> data and not the changelog. The other advantage is that the db2bak
>>> format includes the indexes and ldif restore would require a rebuild.
>>
>> It is good to have a long term backup in LDIF format - no matter what
>> happens to the database, if you have an LDIF backup, you can always
>> recreate your data.  So it's good to have both - db2bak format for
>> shorter term/frequent backups, and LDIF for longer term/infrequent 
>> backups.
>
> Ok, so maybe when we do a full backup first we snapshot the database 
> to ldif via db2ldif.pl, then back up the whole shebang?
>
>>>
>>>> For restore are we sure we can reload data w/o alterations ? What 
>>>> about
>>>> plugins ? will we have to disable all plugins during a restore ?
>>>
>>> Yes, it should be fine. I'm hoping that the version will help us with
>>> this, to prevent someone from restoring an ancient backup on a new
>>> system, for example (or the reverse).
>>>
>>>> For the open questions.
>>>>
>>>> Size of backup:
>>>> I think we should make it easy to configure the system to use a custom
>>>> directory to dump the backups. This way admins can make sure it is 
>>>> on a
>>>> different partition/disk or even over NFS and that the backup will not
>>>> fill up the disk on which DS is running.
>>>
>>> That's a good idea. I'll have to think about where we would configure
>>> that. Perhaps as an optional argument to the backup command.
>> You'll have to figure out a way around selinux, or add some sort of
>> selinux magic that allows db2bak to write there.
>>>
>>>> We should definitely allow to encrypt backup files, a gpg public key
>>>> would be sufficient.
>>>
>>> Ok. I wasn't sure if there would be any corruption concerns.
>>>
>>>> For replica cases, maybe we can create a command that dumps the
>>>> changelog from a good replica and then allow us to reply all changes
>>>> that are missing from the backup to bring the server up to the last
>>>> minute ?
>>>
>>> This would happen when we went online anyway though, at least for the
>>> entries currently in the changelog. I guess this would have the
>>> advantage of doing it in bulk and not over a (potentially) slow link.
>>
>> That should happen automatically with the replication protocol - it will
>> attempt to bring older replicas up-to-date or, if they are too far out
>> of date, will complain that they need a re-init.
>
> Yeah, I guess one way or another the bits are going to need to go over 
> that slow link. I was thinking about a sneakernet to move the 
> changelog, but yeah, in retrospect that doesn't make a lot of sense.
>
> thanks
>
> rob
>