[Pulp-dev] Fwd: Re: Changesets Challenges

Thu Apr 19 18:50:20 UTC 2018

Jeff and I met and we put together two pieces of work which would create a
declarative interface for a plugin writer to use. This would be used in
stead of the Changeset interface by plugin writers. Whether or not to
continue including the ChangeSet in the plugin API is still being discussed.

There seemed to be interest in offering an interface like this so on Monday
we will put together a PR so that we can see what it looks like and how
hard it would be to switch. Look at these stories in the hopes that we can
groom them and put them on the sprint.

* https://pulp.plan.io/issues/3570
* https://pulp.plan.io/issues/3582

Our plan is to start on ^ on Monday so if there are questions, ideas, or
concerns let us know. Once we have something to share, we'll email back to
this thread. Feel free to comment on the issues directly also.

Thanks,
Brian & Jeff

On Mon, Apr 16, 2018 at 3:10 PM, Dennis Kliban <dkliban at redhat.com> wrote:

> On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>
>>> Thanks for the proposal, Brian.  I also commented on the issue.
>>>
>>> On 04/16/2018 09:41 AM, Brian Bouterse wrote:
>>>
>>> I wrote up a description of the opportunity I see here [0]. I put a high
>>> level pro/con analysis below. I would like feedback on (a) if this
>>> adequately addresses the problem statements, (b) if there are alternatives,
>>> and (c) does this improve the plugin wrtier's experience enough to adopt
>>> this?
>>>
>>> pros:
>>> * significantly less plugin code to write. Compare the Thing example
>>> code versus the current docs.
>>>
>>> +1
>>>
>>> * Higher performing with metadata downloading and parsing being included
>>> in stream processing. This causes sync's for pulp_ansible to start 6+ min
>>> earlier.
>>>
>>>
>>> This could also be done currently with the ChangeSet as-is.
>>>
>>>
>>> cons:
>>> * Progress reporting doesn't know how many things it's processing (it's
>>> a stream). So user's would see progress as "X things completed", not "X of
>>> Y things completed". Y can't be known until just before the stream
>>> processing completes otherwise it's not stream processing.
>>>
>>>
>>> I'm not a fan of the SizedIterator either.
>>> I contemplated this when designing the ChangeSet.  An alternative I
>>> considered was to report progress like OSTree does.  It reports progress by
>>> periodically updating the expected TOTAL.  It's better than nothing.
>>>
>>>
>> What if we allow plugin writers to optionally provide a total number when
>> instantiating the ChangeSet? I bet there will be cases where the number of
>> items in the repository version will be known without having to fully parse
>> all the metadata. In these cases the progress reporting could be more
>> informative.
>>
>>
>
> Here is another idea for progress reporting for stream processing: have
> ChangeSet create a separate progress report for downloads. The total could
> by dynamically updated as downloads are scheduled. The complete count can
> be updated after each successful download.
>
> Any limitations in progress reporting are outweighed by the efficiency
> gained by having plugins always use stream processing. Just imagine not
> having to wait for the RPM plugin to finish "processing metadata" to start
> downloading content.
>
>
>>
>>>
>>> [0]: https://pulp.plan.io/issues/3570
>>>
>>> Thanks!
>>> Brian
>>>
>>>
>>>
>>> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>>>
>>>>
>>>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <jortel at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>>>>
>>>>>> I think we should look into this in the near-term. Changing an
>>>>>> interface on an object used by all plugins will be significantly easier,
>>>>>> earlier.
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jortel at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jortel at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>>>>>>
>>>>>>>> Several plugins have started using the Changesets including
>>>>>>>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
>>>>>>>> provide several distinct points of value which are great, but there are two
>>>>>>>> challenges I want to bring up. I want to focus only on the problem
>>>>>>>> statements first.
>>>>>>>>
>>>>>>>> 1. There is redundant "differencing" code in all plugins. The
>>>>>>>> Changeset interface requires the plugin writer to determine what units need
>>>>>>>> to be added and those to be removed. This requires all plugin writers to
>>>>>>>> write the same non-trivial differencing code over and over. For example,
>>>>>>>> you can see the same non-trivial differencing code present in
>>>>>>>> pulp_ansible
>>>>>>>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>>>>>> pulp_file
>>>>>>>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>>>>>> and pulp_python
>>>>>>>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>>>>>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
>>>>>>>> the sync code itself in each plugin.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ten lines of trivial set logic hardly seems like a big deal but any
>>>>>>>> duplication is worth exploring.
>>>>>>>>
>>>>>>> It's more than ten lines. Take pulp_ansible for example. By my count
>>>>>>> (the linked to section) it's 89 lines, which out of 306 lines of plugin
>>>>>>> code for sync is 29% of extra redundant code. The other plugins have
>>>>>>> similar numbers. So with those numbers in mind, what do you think?
>>>>>>>
>>>>>>>
>>>>>>> I was counting the lines (w/o comments) in find_delta() based on the
>>>>>>> linked code.  Which functions are you counting?
>>>>>>>
>>>>>>
>>>>>> I was counting the find_delta, build_additions, and build_removals
>>>>>> methods. Regardless of how the lines are counted, that differencing code is
>>>>>> the duplication I'm talking about. There isn't a way to use the changesets
>>>>>> without duplicating that differencing code in a plugin.
>>>>>>
>>>>>>
>>>>>> The differencing code is limited to find_delta() and perhaps
>>>>>> build_removals().  Agreed, the line count is less useful than specifically
>>>>>> identifying duplicate code.  Outside of find_delta(), I see similar code
>>>>>> (in part because it got copied from file plugin) but not seeing actual
>>>>>> duplication.  Can you be more specific?
>>>>>>
>>>>>
>>>>> Very similar code or identical code, I think it begs the question why
>>>>> are we having plugin writer's do this at all? What value are they creating
>>>>> with it? I don't have a reasonable answer to that question, so the
>>>>> requirement for plugin writer's to write that code brings me back to the
>>>>> problem statement: "plugin writers have redundant differencing code when
>>>>> using Changesets". More info on why it is valuable for the plugin writer to
>>>>> do the differencing code versus the Changesets would be helpful.
>>>>>
>>>>>
>>>>> The ChangeSet abstraction (and API) is based on following division of
>>>>> responsibility:
>>>>>
>>>>> The plugin  (with an understanding of the remote and its content):
>>>>>   - Download metadata.
>>>>>   - Parse metadata
>>>>>   - Based on the metadata:
>>>>>     - determine content to be added to the repository.
>>>>>       - define how artifacts are downloaded.
>>>>>       - construct content
>>>>>     - determine content to be removed to the repository.
>>>>>
>>>>> Core (without understand of specific remote or its content):
>>>>>   - Provide low level API for plugin to affect the changes it has
>>>>> determined need to be made to the repository.  This is downloaders, models
>>>>> etc.
>>>>>   - Provide high(er) level API for plugin to affect the changes it has
>>>>> determined need to be made to the repository.  This is the ChangeSet.
>>>>>
>>>>> Are you proposing that this is not the correct division?
>>>>>
>>>>
>>>> Yes I believe these problem statements suggest we should adjust the
>>>> plugin writer's responsibilities when interacting with the Changesets in
>>>> two specific ways. It's not exactly the language you used, but I believe
>>>> the following two responsibilities could be moved into the Changesets
>>>> entirely:
>>>>
>>>> - determining if any given Artifact or Content unit is already present
>>>> in Pulp (aka computing what needs tobe added)
>>>>
>>>>
>>>> Did you mean *added* to the repository or *created* in pulp.
>>>> Currently, the plugin determines the content that needs to be added to the
>>>> repository.  This is modeled using a PendingContent which fully defines the
>>>> Content (unit) and its PendingArtifact(s) which are included in the
>>>> *additions*. The ChangeSet does determine whether or not any artifacts
>>>> need to be downloaded (and downloads them based on policy) and determines
>>>> which Content needs to be *created* vs simply added to the
>>>> repository.  The plugin blindly assumes that none of the *pending*
>>>> content has yet been created pulp.  This accomplishes 2 things.  1) reduces
>>>> complexity and decision making by the plugin.  2) provides the ChangeSet
>>>> with all the information needed to *create* and *download* as needed.
>>>> The *additions* represents what the plugin wants to be added to the
>>>> repository to synchronize it with the remote repository.
>>>>
>>>> - determining which content units need to be removed (aka computing the
>>>> removals)
>>>>
>>>>
>>>> I don't see how the ChangeSet has enough information to do this.  The
>>>> plugin can (most likely will) make the decision about what to remove based
>>>> on remote metadata, policy and configuration.
>>>>
>>>>
>>>> ^ goals are a restating of the problem statement that plugin writers
>>>> are asked to do differencing calculations when the Changesets could provide
>>>> that to the plugin writer instead.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> So a shorter, simpler problem statement is: "to use the changesets
>>>>>> plugin writers have to do extra work to compute additions and removals
>>>>>> parameters".
>>>>>>
>>>>>>
>>>>>> This statement ^ is better but still too vague to actually solve.
>>>>>> Can we elaborate on specifically what "to do extra work" means?
>>>>>>
>>>>>
>>>>> Sure. Removing that vague language is one way to resolve its
>>>>> vagueness. Here's a revised problem statement: "to use the changesets
>>>>> plugin writers have to compute additions and removals parameters". This
>>>>> problem statement would be resolved by a solution that causes the plugin
>>>>> writer to never have to produce these parameters and be replaced by an
>>>>> interface that would require less effort from a plugin writer.
>>>>>
>>>>>
>>>>> I think it's the plugin's responsibility to determine the difference.
>>>>> Aside from that: without an understanding of the metadata and content type,
>>>>> how could the ChangeSet do this?  What might that looks like?
>>>>>
>>>>
>>>> If I'm understanding this correctly, the Changesets already do this for
>>>> additions right? Help check my understanding. If a plugin writer delivers
>>>> PendingContent and PendingArtifacts to the Changesets as 'additions', the
>>>> Changesets will recognize them as already downloaded and not download them
>>>> right? If this is the case, what is the benefit of having plugin writers
>>>> also try to figure out if things should be downloaded?
>>>>
>>>>
>>>> As you pointed out, the plugin writer does not need to figure out what
>>>> needs to be downloaded.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> 2. Plugins can't do end-to-end stream processing. The Changesets
>>>>>>>> themselves do stream processing, but when you call into
>>>>>>>> changeset.apply_and_drain() you have to have fully parsed the metadata
>>>>>>>> already. Currently when fetching all metadata from Galaxy, pulp_ansible
>>>>>>>> takes about 380 seconds (6+ min). This means that the actual Changeset
>>>>>>>> content downloading starts 380 seconds later than it could. At the heart of
>>>>>>>> the problem, the fetching+parsing of the metadata is not part of the stream
>>>>>>>> processing.
>>>>>>>>
>>>>>>>>
>>>>>>>> The additions/removals can be any interable (like generator) and by
>>>>>>>> using ChangeSet.apply() and iterating the returned object, the pluign can
>>>>>>>> "turn the crank" while downloading and processing the metadata.  The
>>>>>>>> ChangeSet.apply_and_drain() is just a convenience method.  I don't see how
>>>>>>>> this is a limitation of the ChangeSet.
>>>>>>>>
>>>>>>>
>>>>>>> That is new info for me (and maybe everyone). OK so Changesets have
>>>>>>> two interfaces. apply() and apply_and_drain(). Why do we have two
>>>>>>> interfaces when apply() can support all existing use cases (that I know of)
>>>>>>> and do end-to-end stream processing but apply_and_drain() cannot? I see all
>>>>>>> of our examples (and all of our new plugins) using apply_and_drain().
>>>>>>>
>>>>>>>
>>>>>>> The ChangeSet.apply() was how I designed (and documented) it.  Not
>>>>>>> sure when/who added the apply_and_drain().  +1 for removing it.
>>>>>>>
>>>>>>
>>>>>> I read through the changeset docs. I think this stream processing
>>>>>> thing is still a problem but perhaps in how we're presenting the Changeset
>>>>>> with it's arguments. I don't think apply() versus apply_and_drain() are at
>>>>>> all related. Regardless of if you are using apply() or apply_and_drain(),
>>>>>> the Changeset requires an 'additions' and 'removals' arguments. This sends
>>>>>> a clear message to the plugin writer that they need to compute additions
>>>>>> and removals. They will fetch the metadata to compute these which is mostly
>>>>>> how the changeset documentation reads. To know that they could present a
>>>>>> generator that would correctly allow the metdata from inside the Changeset
>>>>>> is I feel as non-obvious. I want the high-performing implementation to be
>>>>>> the obvious one.
>>>>>>
>>>>>> So what about a problem statement like this: "Changesets are
>>>>>> presented such that when you call into them you should already have fetched
>>>>>> the metadata"?
>>>>>>
>>>>>>
>>>>>> I'm not sure what is meant by "presented".  If this means that we
>>>>>> should provide an example of how the ChangeSet can be used by plugins (with
>>>>>> large metadata) in such a way that does not require downloading all the
>>>>>> metadata first - that sounds like a good idea.
>>>>>>
>>>>>
>>>>> Cool so this is transitioning to ideas for resolution. The solution to
>>>>> add documentation on how to do this with the existing interface is one
>>>>> option. My concern with adding additional docs on how to use the current
>>>>> interface better is that if users choose to follow the existing docs then
>>>>> they will have the stream processing problem once again. To me, this
>>>>> suggests that this new example should actually replace the existing
>>>>> documentation.
>>>>>
>>>>>
>>>>> Seems like both example would be useful.  I'm not convinced that all
>>>>> plugins would benefit from this.  For example: the File plugin manifest is
>>>>> small and would likely not benefit from the extra complexity.  For
>>>>> complicated plugins (like RPM), can differencing decision be made before
>>>>> analyzing the entire metadata (eg: primary.xml)?  Also, it's not clear to
>>>>> me how this would work using the Downloader.  Are you suggesting that the
>>>>> plugin would parse/process metadata files while they're being downloaded?
>>>>> Perhaps a better understanding of the flow to be supported would help me
>>>>> understand this.
>>>>>
>>>>
>>>> Yes I am suggesting just that: that the Changesets could facilitate
>>>> parse/processing metadata files while actual content named in those files
>>>> is also being downloaded. I have a straightforward idea on how to achieve
>>>> this. It's short and easy enough to write up (no code), but I want to make
>>>> sure I'm not moving beyond the problem statement without others. Is there
>>>> more we want to do on these problem statements, or would answering a bit
>>>> about one way it could work be helpful?
>>>>
>>>>
>>>> The *additions* can be (and usually is) a generator.  The generator
>>>> can yield based on metadata as it is downloaded and digested.  In this way,
>>>> the ChangeSet already facilitates this.
>>>>
>>>>
>>>> Just to state my expectations: Moving beyond the problem statement I
>>>> don't consider to be a commitment to solve it; just an agreement on what
>>>> we're solving as we discuss various resolutions. Problem statements can
>>>> also always be revisited. Either way forward is fine w/ me, just let me
>>>> know how we should continue
>>>>
>>>>
>>>> So far, I'm not convinced that any specific problems/deficiencies have
>>>> been identified.  That said, you seem to have a different abstraction in
>>>> mind. I would be interested in reviewing it and how it would be used by
>>>> plugin writers.  It may help illustrate the gains that you are envisioning.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Do you see the same challenges I do? Are these the right problem
>>>>>>>> statements? I think with clear problem statements a solution will be easy
>>>>>>>> to see and agree on.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not convinced that these are actual problems/challenges that
>>>>>>>> need to be addressed in the near term.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Brian
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pulp-dev mailing listPulp-dev at redhat.comhttps://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pulp-dev mailing list
>>>>>>>> Pulp-dev at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180419/026fe33e/attachment.htm>