[Pulp-dev] repository versions update

Mon Dec 18 22:09:45 UTC 2017

tl;dr - @dkliban, @bmbouter, and I met and we propose adopting the second
proposal because it has better performance and is more line with how we
think users will use repository versions (i.e. in a linear fashion rather
than a tree/branching model). We've also updated the user stories to remove
the base_version features and we're hoping to get @mhrivnak's PR merged
this week.

# Background

I ran through some performance tests on the first proposal which involved
storing a direct relationship between repository versions and content. The
results[0] show that for a smalli/medium-size system with 100M associations
between repository versions and content, it would take about a minute to
create a new repo version with 10,000 units in the database. 100M
associations also required a table size of at least 7GB and an index size
of 15GB.

I don't think this is a dealbreaker in and of itself. It's possible we
could do some optimizations if we really want to adopt the first proposal
(e.g. use int keys instead of UUIDs, table partitioning, etc). I think it's
worth asking though what we want to optimize for which brings me to the
next point.

# Linear vs Branching

A main consideration for us was how users would use Pulp 3. The strength of
the second proposal (in which additions/removals are stored) is when a few
units are added/removed to the latest repo version. This case captures how
a majority of users will create new versions in Pulp. This is basically a
linear sort of model in which new versions are always based off the
previous version.

The first proposal better supports creating versions from a base_version
which may or may not be a latest version. This is a branching sort of model
(like git) that offers more flexibility to our users but we feel like a
majority of the time, users would not be doing this when creating a new
version. And optimizing for a less frequently used use case is imprudent.

Therefore, we think it makes sense to adopt the second proposal and store
only additions/removals of content from a repository version. Also, we
think that the base_version feature (allowing users to make changes to an
older repo version) should not be a part of the MVP and maybe we can
consider it for 3.1+.

# Next Steps

We've updated the user stories in the MVP document to remove the
terminology around base_version[1]. We're going to break them up into
separate user stories under our Repo Version tracker[2] and add a few of
the basic ones around CRD repo versions to the sprint.

Also, we're going to work on accepting @mhrivnak's repo version PR[3]. I
think it's mostly ready, and just needs some re-review and ACKs.

# Feedback

If you have any thoughts, please respond. We're hoping to get the ball
rolling on repo versions ASAP. Thank you all for your help!

[0] https://github.com/daviddavis/pulp_repo_version_test#results
[1]
https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Minimum_Viable_Product/diff?utf8=%E2%9C%93&version=136&version_from=135&commit=View+differences
[2] https://pulp.plan.io/issues/3209
[3] https://github.com/pulp/pulp/pull/3228

David

On Sun, Dec 17, 2017 at 3:30 PM, Michael Hrivnak <mhrivnak at redhat.com>
wrote:

> I decided to rebase the PR onto latest 3.0-dev just so it doesn't get too
> stale, particularly since the un-nesting work had a substantial impact. I
> also updated the gist containing tests. Feel free to have a look.
>
> I also addressed all the feedback on the PR. I did not implement any new
> behavior, such as adding a boolean value to the version model, since it
> seems like discussions may not be complete about what to name it and how it
> should be used. That seems easy enough to implement as an additional change.
>
> On Mon, Dec 4, 2017 at 10:11 AM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> I am looking forward to discussing the use cases. I hope we can get
>> versioned repositories into 3.0. Thanks everyone for the discussion so far.
>>
>> -Dennis
>>
>> On Fri, Dec 1, 2017 at 5:16 PM, Brian Bouterse <bbouters at redhat.com>
>> wrote:
>>
>>> Thank you all for such great discussion!
>>>
>>> To recap some discussion we had today. We are going to look at the
>>> versioned repos use cases at an upcoming MVP call in the near future
>>> (probably 12/8). Look for the pulp-list announcement. If you have use cases
>>> you want to share, you can add them in red in the Versioned Repos section
>>> of the MVP here:  https://pulp.plan.io/projects/
>>> pulp/wiki/Pulp_3_Minimum_Viable_Product/#Versioned-Repositories
>>>
>>> Once the use cases are known, we can look at the PR and see if it
>>> fulfills them. From the discussion today, the general consensus is that gap
>>> will be relatively small, which makes including it in Pulp3 feasible.
>>>
>>> @misa providing those types of features may be possible. Imagine an
>>> optional attribute on a repo version named 'frozen' that defaults to True.
>>> While the latest repo_version for a repo has frozen=False, any action that
>>> would normally create a new repo version (copy, add/remove, delete, etc)
>>> would act on the existing repo version and *not* create a new one. Then the
>>> user can update the frozen attribute of the repo version when they want,
>>> which commits the transaction as a repo version. I don't think this would
>>> be too hard to implement.
>>>
>>>
>>> On Thu, Nov 30, 2017 at 3:20 PM, Michael Hrivnak <mhrivnak at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Nov 30, 2017 at 11:43 AM, Mihai Ibanescu <
>>>> mihai.ibanescu at gmail.com> wrote:
>>>>
>>>>> I am late to the thread, so I apologize if I repeat things that have
>>>>> been discussed already.
>>>>>
>>>>> Is it a meaningful use case to publish an older version of the repo?
>>>>> Once published, do you keep track of which version got published, and how
>>>>> do you decide which version to push next? This seems like a complication to
>>>>> me.
>>>>>
>>>>>
>>>> A publication will have a reference to the version that it was created
>>>> from. To illustrate how that would get used: Your CTO calls early on a
>>>> Saturday morning and says "I read in the news about a major security flaw
>>>> in cowsay, and I know our applications depend heavily on it. What version
>>>> do we have deployed right now???!!!" You can concretely determine which
>>>> publications are being currently "distributed" to your infrastructure, and
>>>> from there see their exact content sets by virtue of the repo version.
>>>>
>>>> Then there is the promotion workflow, which in Pulp 2 requires a lot of
>>>> copying and re-publishing. With repo versions, you'll have a sequence of
>>>> versions of course. Let's say there's 1, 2 and 3. Version 1 is deployed
>>>> now, version 2 is undergoing testing, and version 3 got created last night
>>>> by the weekly sync job you setup. You would have two different distributors
>>>> that make these publications available to clients: one for production, and
>>>> one for testing. "Promotion" becomes just the act of updating the reference
>>>> on a distribution to a different publication. When testing on version 2 is
>>>> done, assuming it passes, you can update the production distribution to
>>>> make it use version 2.
>>>>
>>>> There are a few use cases for publishing an old version.
>>>>
>>>> One is: I want to publish the same exact content set two different
>>>> ways, with two different publishers. If the contents change between
>>>> publishes, I want a guarantee that it won't cause the second publish to use
>>>> different content than the first.
>>>>
>>>> Second: I like the state of the content in a repo as it is right now. I
>>>> want to publish that exact content set. If any changes happen to the
>>>> content in that repo between now and when my publish task gets run by a
>>>> worker, I don't want those changes to affect the publish I'm requesting
>>>> right now.
>>>>
>>>> Third: I want the ability to roll back from a bad content set to a
>>>> known-good one. How many publications must I keep around to have confidence
>>>> that if I need to roll back some distance, that publication will still be
>>>> available? It's valuable to know I can re-publish an older version any time
>>>> I need it.
>>>>
>>>> Fourth: In some cases you may decide after-the-fact that you need to
>>>> publish the same content set a different way. Maybe you went to kickstart
>>>> from a yum repo and then remembered that (this is a true story) one version
>>>> of your installer is too old to know about sha256 checksums, so you have to
>>>> go re-publish the same content set with different settings for how the
>>>> metadata gets generated.
>>>>
>>>> Otherwise, just as reproducible builds of software is a very valuable
>>>> trait, reproducible publishes of repositories are valuable for similar
>>>> reasons.
>>>>
>>>>
>>>>
>>>>> As a user / content developer, it seems more useful to me to always
>>>>> publish the latest (i.e. don't have an optional version for publishing),
>>>>> but have the ability to copy from a specific version of a repo into another
>>>>> repo (or the same repo, effectively reverting the content of latest).
>>>>>
>>>>> So I would shift the discussion away from the REST API (for now), and
>>>>> more into the expected behavior for manipulating content within pulp. The
>>>>> operations I am aware of are: syncing units, importing units,
>>>>> copying/deleting units, and I am seeking clarification on how versioning
>>>>> will work for each.
>>>>>
>>>>> Syncing is probably the easiest, because it can handle all the changes
>>>>> internally and create a new version at the end.
>>>>>
>>>>> For importing, if you don't want to create unnecessary intermediate
>>>>> versions that are meaningless, I would want the ability to upload more than
>>>>> one unit and associate it to the repo, and then create a version. In other
>>>>> words, a transactional multi-upload.
>>>>>
>>>>
>>>> Indeed. We want to have a behavior in Pulp 3 anyway that lets you
>>>> arbitrarily add and remove multiple content units in one operation. That's
>>>> one of the more notable missing features from Pulp 2. As Brian has pointed
>>>> out, one option is to let a user directly POST to a "versions" endpoint and
>>>> express what content they want to add/remove. Even without repo versions,
>>>> we'd still want an API that lets you bulk add/remove.
>>>>
>>>>
>>>>> For copying, as suggested above, I want to optionally specify the
>>>>> version.
>>>>>
>>>>> Deleting by itself is not hard, it does what it needs to do and then
>>>>> creates a version.
>>>>>
>>>>> The more complicated use case would be: what if I wanted to change the
>>>>> contents of repoA:
>>>>> * add 3 packages from repo1 version 1
>>>>> * add 4 packages from repo2 (latest)
>>>>> * delete 5 packages
>>>>>
>>>>> and at the end have a single version change for repoA.
>>>>>
>>>>> Or, for the same repoA:
>>>>> * delete all units of type "rpm" and name "glibc"
>>>>> * copy unit type "rpm" and name "glibc" from two versions ago
>>>>>
>>>>>
>>>>> If you wanted this use case, then you need a new resource type,
>>>>> somewhat similar to a Task, let's call it Transaction. It is tied to the
>>>>> repository it operates on (repoA in the example above), and locks it from
>>>>> further changes until the transaction is committed or aborted. It could be
>>>>> implemented internally as a repository. You start with the current contents
>>>>> of repoA, and you perform whatever operations you need to do (including
>>>>> changing repo metadata). When you "commit" the Transaction, it becomes
>>>>> *the* new version of the repository and unlocks repoA.
>>>>>
>>>>
>>>> Yep, we're on the same page with the use case I think. The other option
>>>> is to let you as a user query for whatever content you care about adding
>>>> and removing; find it however you see fit. Then use the bulk add/remove
>>>> feature to carry that out in one operation.
>>>>
>>>> I do like the idea of persistently storing a Transaction as you call
>>>> it, and possibly even letting a user build one explicitly. Even just as an
>>>> implementation detail, any bulk add/remove endpoint may need to store the
>>>> requested changes temporarily in the database as a means to get the input
>>>> from the web handler to a celery worker. We probably don't want to stuff
>>>> 10k+ content references into an AMQP message and pass them all in as an
>>>> argument to the task. And if we're going to store them in the DB, maybe it
>>>> would make sense to expose that to the user and let them create a
>>>> Transaction directly.
>>>>
>>>>
>>>>> Whether a Version is a full copy of the repo or a delta is an
>>>>> implementation detail. I would argue for full copy, otherwise you run into
>>>>> the inefficiencies of cvs which had to apply patches in reverse order just
>>>>> to get to a version in the past. I would find it more useful to have a repo
>>>>> diff resource (diff version 1 with version 3, or repo1 version 1 with repo2
>>>>> latest).
>>>>>
>>>>
>>>> Agreed that it's an implementation detail. In the case of cvs and
>>>> similar, all changes had to be applied sequentially in order to construct a
>>>> final product. When you're only tracking set membership, querying becomes
>>>> MUCH simpler and is very efficient.
>>>>
>>>>
>>>>>
>>>>> Unfortunately, it is a rather large paradigm shift, and not one that
>>>>> you can push in a 3.0 -> 3.1 transition. Parts of it will need to land in
>>>>> 3.0 proper, determining what can be left out is an exercise to the reader
>>>>> who managed to keep up with my long emails.
>>>>>
>>>>> Hey, a man can dream.
>>>>>
>>>>
>>>> I'm dreaming with you! (and also likely putting people to sleep with my
>>>> own long emails) I also think this is a hallmark behavior that is important
>>>> to get right conceptually, and very important to a variety of stakeholders.
>>>>
>>>> Thanks a lot for sharing your insight! If you have more thoughts on
>>>> these use cases, please keep it coming.
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>
>
> --
>
> Michael Hrivnak
>
> Principal Software Engineer, RHCE
>
> Red Hat
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20171218/d57009bb/attachment.htm>