[Pulp-dev] repository versions update

Tue Dec 19 14:50:53 UTC 2017

Thanks for sending this concise summary to update where we are on this.

I think the format is great and very helpful!

On Mon, Dec 18, 2017 at 5:09 PM, David Davis <daviddavis at redhat.com> wrote:
> tl;dr - @dkliban, @bmbouter, and I met and we propose adopting the second
> proposal because it has better performance and is more line with how we
> think users will use repository versions (i.e. in a linear fashion rather
> than a tree/branching model). We've also updated the user stories to remove
> the base_version features and we're hoping to get @mhrivnak's PR merged this
> week.
>
> # Background
>
> I ran through some performance tests on the first proposal which involved
> storing a direct relationship between repository versions and content. The
> results[0] show that for a smalli/medium-size system with 100M associations
> between repository versions and content, it would take about a minute to
> create a new repo version with 10,000 units in the database. 100M
> associations also required a table size of at least 7GB and an index size of
> 15GB.
>
> I don't think this is a dealbreaker in and of itself. It's possible we could
> do some optimizations if we really want to adopt the first proposal (e.g.
> use int keys instead of UUIDs, table partitioning, etc). I think it's worth
> asking though what we want to optimize for which brings me to the next
> point.
>
> # Linear vs Branching
>
> A main consideration for us was how users would use Pulp 3. The strength of
> the second proposal (in which additions/removals are stored) is when a few
> units are added/removed to the latest repo version. This case captures how a
> majority of users will create new versions in Pulp. This is basically a
> linear sort of model in which new versions are always based off the previous
> version.
>
> The first proposal better supports creating versions from a base_version
> which may or may not be a latest version. This is a branching sort of model
> (like git) that offers more flexibility to our users but we feel like a
> majority of the time, users would not be doing this when creating a new
> version. And optimizing for a less frequently used use case is imprudent.
>
> Therefore, we think it makes sense to adopt the second proposal and store
> only additions/removals of content from a repository version. Also, we think
> that the base_version feature (allowing users to make changes to an older
> repo version) should not be a part of the MVP and maybe we can consider it
> for 3.1+.
>
> # Next Steps
>
> We've updated the user stories in the MVP document to remove the terminology
> around base_version[1]. We're going to break them up into separate user
> stories under our Repo Version tracker[2] and add a few of the basic ones
> around CRD repo versions to the sprint.
>
> Also, we're going to work on accepting @mhrivnak's repo version PR[3]. I
> think it's mostly ready, and just needs some re-review and ACKs.
>
> # Feedback
>
> If you have any thoughts, please respond. We're hoping to get the ball
> rolling on repo versions ASAP. Thank you all for your help!
>
> [0] https://github.com/daviddavis/pulp_repo_version_test#results
> [1]
> https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Minimum_Viable_Product/diff?utf8=%E2%9C%93&version=136&version_from=135&commit=View+differences
> [2] https://pulp.plan.io/issues/3209
> [3] https://github.com/pulp/pulp/pull/3228
>
>
> David
>
> On Sun, Dec 17, 2017 at 3:30 PM, Michael Hrivnak <mhrivnak at redhat.com>
> wrote:
>>
>> I decided to rebase the PR onto latest 3.0-dev just so it doesn't get too
>> stale, particularly since the un-nesting work had a substantial impact. I
>> also updated the gist containing tests. Feel free to have a look.
>>
>> I also addressed all the feedback on the PR. I did not implement any new
>> behavior, such as adding a boolean value to the version model, since it
>> seems like discussions may not be complete about what to name it and how it
>> should be used. That seems easy enough to implement as an additional change.
>>
>> On Mon, Dec 4, 2017 at 10:11 AM, Dennis Kliban <dkliban at redhat.com> wrote:
>>>
>>> I am looking forward to discussing the use cases. I hope we can get
>>> versioned repositories into 3.0. Thanks everyone for the discussion so far.
>>>
>>> -Dennis
>>>
>>> On Fri, Dec 1, 2017 at 5:16 PM, Brian Bouterse <bbouters at redhat.com>
>>> wrote:
>>>>
>>>> Thank you all for such great discussion!
>>>>
>>>> To recap some discussion we had today. We are going to look at the
>>>> versioned repos use cases at an upcoming MVP call in the near future
>>>> (probably 12/8). Look for the pulp-list announcement. If you have use cases
>>>> you want to share, you can add them in red in the Versioned Repos section of
>>>> the MVP here:
>>>> https://pulp.plan.io/projects/pulp/wiki/Pulp_3_Minimum_Viable_Product/#Versioned-Repositories
>>>>
>>>> Once the use cases are known, we can look at the PR and see if it
>>>> fulfills them. From the discussion today, the general consensus is that gap
>>>> will be relatively small, which makes including it in Pulp3 feasible.
>>>>
>>>> @misa providing those types of features may be possible. Imagine an
>>>> optional attribute on a repo version named 'frozen' that defaults to True.
>>>> While the latest repo_version for a repo has frozen=False, any action that
>>>> would normally create a new repo version (copy, add/remove, delete, etc)
>>>> would act on the existing repo version and *not* create a new one. Then the
>>>> user can update the frozen attribute of the repo version when they want,
>>>> which commits the transaction as a repo version. I don't think this would be
>>>> too hard to implement.
>>>>
>>>>
>>>> On Thu, Nov 30, 2017 at 3:20 PM, Michael Hrivnak <mhrivnak at redhat.com>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 30, 2017 at 11:43 AM, Mihai Ibanescu
>>>>> <mihai.ibanescu at gmail.com> wrote:
>>>>>>
>>>>>> I am late to the thread, so I apologize if I repeat things that have
>>>>>> been discussed already.
>>>>>>
>>>>>> Is it a meaningful use case to publish an older version of the repo?
>>>>>> Once published, do you keep track of which version got published, and how do
>>>>>> you decide which version to push next? This seems like a complication to me.
>>>>>>
>>>>>
>>>>> A publication will have a reference to the version that it was created
>>>>> from. To illustrate how that would get used: Your CTO calls early on a
>>>>> Saturday morning and says "I read in the news about a major security flaw in
>>>>> cowsay, and I know our applications depend heavily on it. What version do we
>>>>> have deployed right now???!!!" You can concretely determine which
>>>>> publications are being currently "distributed" to your infrastructure, and
>>>>> from there see their exact content sets by virtue of the repo version.
>>>>>
>>>>> Then there is the promotion workflow, which in Pulp 2 requires a lot of
>>>>> copying and re-publishing. With repo versions, you'll have a sequence of
>>>>> versions of course. Let's say there's 1, 2 and 3. Version 1 is deployed now,
>>>>> version 2 is undergoing testing, and version 3 got created last night by the
>>>>> weekly sync job you setup. You would have two different distributors that
>>>>> make these publications available to clients: one for production, and one
>>>>> for testing. "Promotion" becomes just the act of updating the reference on a
>>>>> distribution to a different publication. When testing on version 2 is done,
>>>>> assuming it passes, you can update the production distribution to make it
>>>>> use version 2.
>>>>>
>>>>> There are a few use cases for publishing an old version.
>>>>>
>>>>> One is: I want to publish the same exact content set two different
>>>>> ways, with two different publishers. If the contents change between
>>>>> publishes, I want a guarantee that it won't cause the second publish to use
>>>>> different content than the first.
>>>>>
>>>>> Second: I like the state of the content in a repo as it is right now. I
>>>>> want to publish that exact content set. If any changes happen to the content
>>>>> in that repo between now and when my publish task gets run by a worker, I
>>>>> don't want those changes to affect the publish I'm requesting right now.
>>>>>
>>>>> Third: I want the ability to roll back from a bad content set to a
>>>>> known-good one. How many publications must I keep around to have confidence
>>>>> that if I need to roll back some distance, that publication will still be
>>>>> available? It's valuable to know I can re-publish an older version any time
>>>>> I need it.
>>>>>
>>>>> Fourth: In some cases you may decide after-the-fact that you need to
>>>>> publish the same content set a different way. Maybe you went to kickstart
>>>>> from a yum repo and then remembered that (this is a true story) one version
>>>>> of your installer is too old to know about sha256 checksums, so you have to
>>>>> go re-publish the same content set with different settings for how the
>>>>> metadata gets generated.
>>>>>
>>>>> Otherwise, just as reproducible builds of software is a very valuable
>>>>> trait, reproducible publishes of repositories are valuable for similar
>>>>> reasons.
>>>>>
>>>>>
>>>>>>
>>>>>> As a user / content developer, it seems more useful to me to always
>>>>>> publish the latest (i.e. don't have an optional version for publishing), but
>>>>>> have the ability to copy from a specific version of a repo into another repo
>>>>>> (or the same repo, effectively reverting the content of latest).
>>>>>>
>>>>>> So I would shift the discussion away from the REST API (for now), and
>>>>>> more into the expected behavior for manipulating content within pulp. The
>>>>>> operations I am aware of are: syncing units, importing units,
>>>>>> copying/deleting units, and I am seeking clarification on how versioning
>>>>>> will work for each.
>>>>>>
>>>>>> Syncing is probably the easiest, because it can handle all the changes
>>>>>> internally and create a new version at the end.
>>>>>>
>>>>>> For importing, if you don't want to create unnecessary intermediate
>>>>>> versions that are meaningless, I would want the ability to upload more than
>>>>>> one unit and associate it to the repo, and then create a version. In other
>>>>>> words, a transactional multi-upload.
>>>>>
>>>>>
>>>>> Indeed. We want to have a behavior in Pulp 3 anyway that lets you
>>>>> arbitrarily add and remove multiple content units in one operation. That's
>>>>> one of the more notable missing features from Pulp 2. As Brian has pointed
>>>>> out, one option is to let a user directly POST to a "versions" endpoint and
>>>>> express what content they want to add/remove. Even without repo versions,
>>>>> we'd still want an API that lets you bulk add/remove.
>>>>>
>>>>>>
>>>>>> For copying, as suggested above, I want to optionally specify the
>>>>>> version.
>>>>>>
>>>>>> Deleting by itself is not hard, it does what it needs to do and then
>>>>>> creates a version.
>>>>>>
>>>>>> The more complicated use case would be: what if I wanted to change the
>>>>>> contents of repoA:
>>>>>> * add 3 packages from repo1 version 1
>>>>>> * add 4 packages from repo2 (latest)
>>>>>> * delete 5 packages
>>>>>>
>>>>>> and at the end have a single version change for repoA.
>>>>>>
>>>>>> Or, for the same repoA:
>>>>>> * delete all units of type "rpm" and name "glibc"
>>>>>> * copy unit type "rpm" and name "glibc" from two versions ago
>>>>>>
>>>>>>
>>>>>> If you wanted this use case, then you need a new resource type,
>>>>>> somewhat similar to a Task, let's call it Transaction. It is tied to the
>>>>>> repository it operates on (repoA in the example above), and locks it from
>>>>>> further changes until the transaction is committed or aborted. It could be
>>>>>> implemented internally as a repository. You start with the current contents
>>>>>> of repoA, and you perform whatever operations you need to do (including
>>>>>> changing repo metadata). When you "commit" the Transaction, it becomes *the*
>>>>>> new version of the repository and unlocks repoA.
>>>>>
>>>>>
>>>>> Yep, we're on the same page with the use case I think. The other option
>>>>> is to let you as a user query for whatever content you care about adding and
>>>>> removing; find it however you see fit. Then use the bulk add/remove feature
>>>>> to carry that out in one operation.
>>>>>
>>>>> I do like the idea of persistently storing a Transaction as you call
>>>>> it, and possibly even letting a user build one explicitly. Even just as an
>>>>> implementation detail, any bulk add/remove endpoint may need to store the
>>>>> requested changes temporarily in the database as a means to get the input
>>>>> from the web handler to a celery worker. We probably don't want to stuff
>>>>> 10k+ content references into an AMQP message and pass them all in as an
>>>>> argument to the task. And if we're going to store them in the DB, maybe it
>>>>> would make sense to expose that to the user and let them create a
>>>>> Transaction directly.
>>>>>
>>>>>>
>>>>>> Whether a Version is a full copy of the repo or a delta is an
>>>>>> implementation detail. I would argue for full copy, otherwise you run into
>>>>>> the inefficiencies of cvs which had to apply patches in reverse order just
>>>>>> to get to a version in the past. I would find it more useful to have a repo
>>>>>> diff resource (diff version 1 with version 3, or repo1 version 1 with repo2
>>>>>> latest).
>>>>>
>>>>>
>>>>> Agreed that it's an implementation detail. In the case of cvs and
>>>>> similar, all changes had to be applied sequentially in order to construct a
>>>>> final product. When you're only tracking set membership, querying becomes
>>>>> MUCH simpler and is very efficient.
>>>>>
>>>>>>
>>>>>>
>>>>>> Unfortunately, it is a rather large paradigm shift, and not one that
>>>>>> you can push in a 3.0 -> 3.1 transition. Parts of it will need to land in
>>>>>> 3.0 proper, determining what can be left out is an exercise to the reader
>>>>>> who managed to keep up with my long emails.
>>>>>>
>>>>>> Hey, a man can dream.
>>>>>
>>>>>
>>>>> I'm dreaming with you! (and also likely putting people to sleep with my
>>>>> own long emails) I also think this is a hallmark behavior that is important
>>>>> to get right conceptually, and very important to a variety of stakeholders.
>>>>>
>>>>> Thanks a lot for sharing your insight! If you have more thoughts on
>>>>> these use cases, please keep it coming.
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>
>>
>>
>>
>> --
>>
>> Michael Hrivnak
>>
>> Principal Software Engineer, RHCE
>>
>> Red Hat
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>