[Pulp-dev] versioned repositories
mhrivnak at redhat.com
Thu May 18 02:03:56 UTC 2017
We've discussed versioned repositories and their merits in the past, but
I'd like to propose a specific direction, and inclusion in 3.0. As a recap
of goals, versions can help us answer two important questions about the
history of a repository:
1) What set of content is in a specific version of a repository?
2) What changed between two arbitrary versions of a repository?
I am proposing a model where Pulp creates a new version of a repository for
every operation that changes that repo's content. For example, a sync task
would create a single new version.
- You create repository "foo".
- You sync repository "foo", which produces version 1 of that repo.
- You sync once per day for some period of time, automatically creating a
new version each time.
- You publish repo "foo", which defaults to publishing the most recent
- You don't like something that's new in the repo, so you roll back by
publishing a previous version.
Data Model Basics
In the past we've stored the relationship between a content unit and a repo
as a standard many-to-many through table. There's a reference to a unit,
and a reference to a repo.
The version scheme I'm pitching adds two new fields to that through table:
vadded - a foreign key to the repo version in which this content unit was
vremoved - a foreign key to the repo version in which this content unit was
removed. This can be null.
Multiple entries can exist for the same content unit and repo, so long as a
new one is not added until the previous one's "vremoved" field is set.
With this structure, it is easy to query the database to answer both
questions we started with.
Some endpoint will be made that gives access to the versions of a specific
repository. Ideally we would have a nested endpoint like this:
But nested views have been a problem for us with DRF (django rest
framework). If we aren't able to make that happen, I've gotten this to work
in my PoC branch:
It's not yet clear how best to represent content through the REST API. A
nested endpoint within the repo version object would be ideal.
Operations on a repo where a version could be chosen, such as a publish,
should default to the latest version. It's an open question how best to
represent that, and perhaps it takes the form of two endpoints:
default to latest: POST /api/v3/repositories/foo/distributors/bar/publish
specify a version: POST /api/v3/repositories/foo/versions/4/publish
But that's just one idea. Much about our REST API layout has yet to be
written in stone, and we have flexibility.
Notice that this changes the orphan workflow. Removing a content unit from
a repo doesn't make it an orphan. This helps reduce the need to run an
orphan cleanup task, which in turn helps avoid the inherent race condition
that task can introduce.
But you may not want to keep history forever, so a valuable feature will be
the ability to trim history. I think this would just be an operation that
squashes a bunch of versions together, and it could optionally take that
opportunity to immediately delete a content unit that becomes an orphan.
Illustrating the workflow, if you wanted to squash history prior to version
10, the task would:
- delete all of a repo's relationships in the through table where vremoved
is a version <= 10
- optionally check if each content unit is now an orphan and remove if so
- update all remaining entries where vadded < 10 by setting vadded to 10
I have a branch with proof-of-concept code here:
The models are the most interesting place to look. In particular, I'm very
pleased with how simple the "content()" method is, which returns a QuerySet
matching all the content in a given version.
The rest is REST ;) API stuff mostly, which isn't all that interesting
except to demonstrate how the data could potentially be exposed. You can
run the included tests (which I made just for dev purposes- not sure if
they deserve a long-term home) which are found in the root of the git repo,
and that loads some data into the database. Then you can hit this endpoint
as an example:
Obviously this code is rough, so please consider it for directional and
conceptual purposes only. Assume major additions and improvements if we
follow through on this concept.
Tracking history in this way opens up great possibilities. Some examples:
Promotion could become a matter of having two publishers on a repo with
different settings, one for "testing" and one for "production", and just
publishing whichever version you like with each. Multiple repos and copy
operations are no longer needed for promotion. Austin suggested that the
ability to tag versions with arbitrary key:value pairs could enhance this
An added concept, which could come post-3.0, is tracking publications more
explicitly and associating each with a version. Although I could see a case
for laying this groundwork now before the API is locked down. Promotion
could become more about making a publication available in a different
location, rather than re-creating it. We'd also know which content is part
of a publication, and guarantee that content doesn't get removed before the
publication does. This is a deficiency we have in Pulp 2.
Pulp-to-pulp sync could become very efficient since they could easily
replicate only the changes since the last sync.
Incremental exports become more concrete. Rather than depending on a
timestamp, you can know with certainty which version you have in the remote
location, and thus which newer versions need to be exported.
We could add a "finalized" boolean or similar to a version, and use that to
know if it was successfully completed. If not, for example if a sync task
stopped abruptly, the incomplete version could easily be recognized and
Please ask questions, provide feedback, add ideas, suggest alternatives,
etc. I'm perfectly happy even throwing this PoC away if we come up with
Principal Software Engineer, RHCE
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Pulp-dev