[Pulp-dev] repository versions update

Thu Nov 30 20:20:05 UTC 2017

On Thu, Nov 30, 2017 at 11:43 AM, Mihai Ibanescu <mihai.ibanescu at gmail.com>
wrote:

> I am late to the thread, so I apologize if I repeat things that have been
> discussed already.
>
> Is it a meaningful use case to publish an older version of the repo? Once
> published, do you keep track of which version got published, and how do you
> decide which version to push next? This seems like a complication to me.
>
>
A publication will have a reference to the version that it was created
from. To illustrate how that would get used: Your CTO calls early on a
Saturday morning and says "I read in the news about a major security flaw
in cowsay, and I know our applications depend heavily on it. What version
do we have deployed right now???!!!" You can concretely determine which
publications are being currently "distributed" to your infrastructure, and
from there see their exact content sets by virtue of the repo version.

Then there is the promotion workflow, which in Pulp 2 requires a lot of
copying and re-publishing. With repo versions, you'll have a sequence of
versions of course. Let's say there's 1, 2 and 3. Version 1 is deployed
now, version 2 is undergoing testing, and version 3 got created last night
by the weekly sync job you setup. You would have two different distributors
that make these publications available to clients: one for production, and
one for testing. "Promotion" becomes just the act of updating the reference
on a distribution to a different publication. When testing on version 2 is
done, assuming it passes, you can update the production distribution to
make it use version 2.

There are a few use cases for publishing an old version.

One is: I want to publish the same exact content set two different ways,
with two different publishers. If the contents change between publishes, I
want a guarantee that it won't cause the second publish to use different
content than the first.

Second: I like the state of the content in a repo as it is right now. I
want to publish that exact content set. If any changes happen to the
content in that repo between now and when my publish task gets run by a
worker, I don't want those changes to affect the publish I'm requesting
right now.

Third: I want the ability to roll back from a bad content set to a
known-good one. How many publications must I keep around to have confidence
that if I need to roll back some distance, that publication will still be
available? It's valuable to know I can re-publish an older version any time
I need it.

Fourth: In some cases you may decide after-the-fact that you need to
publish the same content set a different way. Maybe you went to kickstart
from a yum repo and then remembered that (this is a true story) one version
of your installer is too old to know about sha256 checksums, so you have to
go re-publish the same content set with different settings for how the
metadata gets generated.

Otherwise, just as reproducible builds of software is a very valuable
trait, reproducible publishes of repositories are valuable for similar
reasons.

> As a user / content developer, it seems more useful to me to always
> publish the latest (i.e. don't have an optional version for publishing),
> but have the ability to copy from a specific version of a repo into another
> repo (or the same repo, effectively reverting the content of latest).
>
> So I would shift the discussion away from the REST API (for now), and more
> into the expected behavior for manipulating content within pulp. The
> operations I am aware of are: syncing units, importing units,
> copying/deleting units, and I am seeking clarification on how versioning
> will work for each.
>
> Syncing is probably the easiest, because it can handle all the changes
> internally and create a new version at the end.
>
> For importing, if you don't want to create unnecessary intermediate
> versions that are meaningless, I would want the ability to upload more than
> one unit and associate it to the repo, and then create a version. In other
> words, a transactional multi-upload.
>

Indeed. We want to have a behavior in Pulp 3 anyway that lets you
arbitrarily add and remove multiple content units in one operation. That's
one of the more notable missing features from Pulp 2. As Brian has pointed
out, one option is to let a user directly POST to a "versions" endpoint and
express what content they want to add/remove. Even without repo versions,
we'd still want an API that lets you bulk add/remove.

> For copying, as suggested above, I want to optionally specify the version.
>
> Deleting by itself is not hard, it does what it needs to do and then
> creates a version.
>
> The more complicated use case would be: what if I wanted to change the
> contents of repoA:
> * add 3 packages from repo1 version 1
> * add 4 packages from repo2 (latest)
> * delete 5 packages
>
> and at the end have a single version change for repoA.
>
> Or, for the same repoA:
> * delete all units of type "rpm" and name "glibc"
> * copy unit type "rpm" and name "glibc" from two versions ago
>
>
> If you wanted this use case, then you need a new resource type, somewhat
> similar to a Task, let's call it Transaction. It is tied to the repository
> it operates on (repoA in the example above), and locks it from further
> changes until the transaction is committed or aborted. It could be
> implemented internally as a repository. You start with the current contents
> of repoA, and you perform whatever operations you need to do (including
> changing repo metadata). When you "commit" the Transaction, it becomes
> *the* new version of the repository and unlocks repoA.
>

Yep, we're on the same page with the use case I think. The other option is
to let you as a user query for whatever content you care about adding and
removing; find it however you see fit. Then use the bulk add/remove feature
to carry that out in one operation.

I do like the idea of persistently storing a Transaction as you call it,
and possibly even letting a user build one explicitly. Even just as an
implementation detail, any bulk add/remove endpoint may need to store the
requested changes temporarily in the database as a means to get the input
from the web handler to a celery worker. We probably don't want to stuff
10k+ content references into an AMQP message and pass them all in as an
argument to the task. And if we're going to store them in the DB, maybe it
would make sense to expose that to the user and let them create a
Transaction directly.

> Whether a Version is a full copy of the repo or a delta is an
> implementation detail. I would argue for full copy, otherwise you run into
> the inefficiencies of cvs which had to apply patches in reverse order just
> to get to a version in the past. I would find it more useful to have a repo
> diff resource (diff version 1 with version 3, or repo1 version 1 with repo2
> latest).
>

Agreed that it's an implementation detail. In the case of cvs and similar,
all changes had to be applied sequentially in order to construct a final
product. When you're only tracking set membership, querying becomes MUCH
simpler and is very efficient.

>
> Unfortunately, it is a rather large paradigm shift, and not one that you
> can push in a 3.0 -> 3.1 transition. Parts of it will need to land in 3.0
> proper, determining what can be left out is an exercise to the reader who
> managed to keep up with my long emails.
>
> Hey, a man can dream.
>

I'm dreaming with you! (and also likely putting people to sleep with my own
long emails) I also think this is a hallmark behavior that is important to
get right conceptually, and very important to a variety of stakeholders.

Thanks a lot for sharing your insight! If you have more thoughts on these
use cases, please keep it coming.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20171130/401b3cb2/attachment.htm>