[Pulp-list] Omitting RPMs in backups

Tue Oct 11 19:18:27 UTC 2016

Thanks for your answers. Sorry, I realise I forgot to add an overview of
the context, I included it here:

Kodiak Firesmith:
> What is your upstream backups system?  And does it do basic backups of
> everything you tell it every night or is it smart enough to only back up
> what has changed?
> 
> Are your concerns related to data occupied, or backup duration time, or
> something else?

We use IBM Spectrum Protect (was TSM). It's smart enough to sync
efficiently, but I was thinking about saving space to optimise things.

Michael Hrivnak:
> If I understand correctly, I think you want to have one or more yum repos
> managed somewhere outside of pulp that is your "backup". They contain every
> RPM and similar that pulp knows about. Naive question; what is it about
> this type of backup that is more convenient for you than just making a
> traditional backup of /var/lib/pulp/content/ ?

We would use Pulp to manage a mix of:
- RPM repos which would be copies of RHEL,CentOS,... repos, a bit
lighter though (small retention)
- a few user-managed RPM repos to host specific, manually uploaded RPMs.

Which means the vast majority of disk space used on our Pulp server will
be occupied by RPMs available on several sources over the world,
including a mirror we host. So, there are a few hundreds GB to save in
our backup system if we avoid saving that data: not critical, but
appreciated.
However, this might be a little bit tricky as we should save manually
uploaded RPMs. Probably by following symlinks in
/var/lib/pulp/published/yum/[...]/my_user_repos to find them.

> Questions aside, this is theoretically doable. You'll need to be setup for
> pulp's deferred content download feature, which includes deploying squid or
> an equivalent proxy.
> 
> http://docs.pulpproject.org/user-guide/deferred-download.html
> 
> You would first restore your database, and then create a repo in pulp for
> each of these backup repos. For each one:
> 
> - set the download policy to "on_demand"
> - sync. This should discover that each content unit is already in the
> database, associate it to the repo, and populate the on_demand catalog with
> knowledge of its location in this giant feed
> - run the download_repo task with the "verify_all_units" option set to
> True. This will go through each file of each unit, discover it's missing,
> and then download it from the link that was cataloged above.
> ---
> http://docs.pulpproject.org/dev-guide/integration/rest-api/repo/sync.html#download-a-repository
> - delete your "backup" repos from pulp
> 
> This is only possible for yum repos currently, until support for deferred
> download is added to other plugins.
> 
> If you do go through with this as a plan, let us know how testing goes, and
> what tips you would have for the next person who tries it.

Great, that was the kind of idea I was looking for, thanks a lot!
I guess it will need using for the temporary repo a different feed
URL from that of the saved repo.
OK, I will try it and give some feedback.

> 
> Michael
> 
> On Mon, Oct 10, 2016 at 12:03 PM, Nicolas FOURNIALS <
> nicolas.fournials at cc.in2p3.fr> wrote:
> 
>> Hi,
>>
>> I was wondering if there was a way to get Pulp backups lighter, without
>> saving RPMs which are available elsewhere?
>> Actually, backup doc[1] is clear that everything in /var/lib/pulp should
>> be saved. But that would be very efficient if we could restore a Pulp
>> install by just having everything else in place (including manually
>> uploaded RPMs) and then launching a sync to re download every RPM coming
>> from a feed repo.
>>
>> My question may of course apply to other content types.
>>
>>
>> [1] https://docs.pulpproject.org/user-guide/server.html#backups
>>
>> _______________________________________________
>> Pulp-list mailing list
>> Pulp-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-list
>>