[Pulp-dev] Fwd: Re: Changesets Challenges

Thu Apr 19 19:35:24 UTC 2018

Both of those tickets look good to me. I can clearly see how the proposed
changes will simplify the plugin writing experience.

Thank you both for putting together this plan.

On Thu, Apr 19, 2018 at 2:50 PM, Brian Bouterse <bbouters at redhat.com> wrote:

> Jeff and I met and we put together two pieces of work which would create a
> declarative interface for a plugin writer to use. This would be used in
> stead of the Changeset interface by plugin writers. Whether or not to
> continue including the ChangeSet in the plugin API is still being discussed.
>
> There seemed to be interest in offering an interface like this so on
> Monday we will put together a PR so that we can see what it looks like and
> how hard it would be to switch. Look at these stories in the hopes that we
> can groom them and put them on the sprint.
>
> * https://pulp.plan.io/issues/3570
> * https://pulp.plan.io/issues/3582
>
> Our plan is to start on ^ on Monday so if there are questions, ideas, or
> concerns let us know. Once we have something to share, we'll email back to
> this thread. Feel free to comment on the issues directly also.
>
> Thanks,
> Brian & Jeff
>
>
> On Mon, Apr 16, 2018 at 3:10 PM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> On Mon, Apr 16, 2018 at 2:13 PM, Dennis Kliban <dkliban at redhat.com>
>> wrote:
>>
>>> On Mon, Apr 16, 2018 at 12:21 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>>
>>>> Thanks for the proposal, Brian.  I also commented on the issue.
>>>>
>>>> On 04/16/2018 09:41 AM, Brian Bouterse wrote:
>>>>
>>>> I wrote up a description of the opportunity I see here [0]. I put a
>>>> high level pro/con analysis below. I would like feedback on (a) if this
>>>> adequately addresses the problem statements, (b) if there are alternatives,
>>>> and (c) does this improve the plugin wrtier's experience enough to adopt
>>>> this?
>>>>
>>>> pros:
>>>> * significantly less plugin code to write. Compare the Thing example
>>>> code versus the current docs.
>>>>
>>>> +1
>>>>
>>>> * Higher performing with metadata downloading and parsing being
>>>> included in stream processing. This causes sync's for pulp_ansible to start
>>>> 6+ min earlier.
>>>>
>>>>
>>>> This could also be done currently with the ChangeSet as-is.
>>>>
>>>>
>>>> cons:
>>>> * Progress reporting doesn't know how many things it's processing (it's
>>>> a stream). So user's would see progress as "X things completed", not "X of
>>>> Y things completed". Y can't be known until just before the stream
>>>> processing completes otherwise it's not stream processing.
>>>>
>>>>
>>>> I'm not a fan of the SizedIterator either.
>>>> I contemplated this when designing the ChangeSet.  An alternative I
>>>> considered was to report progress like OSTree does.  It reports progress by
>>>> periodically updating the expected TOTAL.  It's better than nothing.
>>>>
>>>>
>>> What if we allow plugin writers to optionally provide a total number
>>> when instantiating the ChangeSet? I bet there will be cases where the
>>> number of items in the repository version will be known without having to
>>> fully parse all the metadata. In these cases the progress reporting could
>>> be more informative.
>>>
>>>
>>
>> Here is another idea for progress reporting for stream processing: have
>> ChangeSet create a separate progress report for downloads. The total could
>> by dynamically updated as downloads are scheduled. The complete count can
>> be updated after each successful download.
>>
>> Any limitations in progress reporting are outweighed by the efficiency
>> gained by having plugins always use stream processing. Just imagine not
>> having to wait for the RPM plugin to finish "processing metadata" to start
>> downloading content.
>>
>>
>>>
>>>>
>>>> [0]: https://pulp.plan.io/issues/3570
>>>>
>>>> Thanks!
>>>> Brian
>>>>
>>>>
>>>>
>>>> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>>>>
>>>>>
>>>>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <jortel at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <jortel at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>>>>>
>>>>>>> I think we should look into this in the near-term. Changing an
>>>>>>> interface on an object used by all plugins will be significantly easier,
>>>>>>> earlier.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jortel at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jortel at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>>>>>>>
>>>>>>>>> Several plugins have started using the Changesets including
>>>>>>>>> pulp_ansible, pulp_python, pulp_file, and perhaps others. The Changesets
>>>>>>>>> provide several distinct points of value which are great, but there are two
>>>>>>>>> challenges I want to bring up. I want to focus only on the problem
>>>>>>>>> statements first.
>>>>>>>>>
>>>>>>>>> 1. There is redundant "differencing" code in all plugins. The
>>>>>>>>> Changeset interface requires the plugin writer to determine what units need
>>>>>>>>> to be added and those to be removed. This requires all plugin writers to
>>>>>>>>> write the same non-trivial differencing code over and over. For example,
>>>>>>>>> you can see the same non-trivial differencing code present in
>>>>>>>>> pulp_ansible
>>>>>>>>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>>>>>>> pulp_file
>>>>>>>>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>>>>>>> and pulp_python
>>>>>>>>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>>>>>>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
>>>>>>>>> the sync code itself in each plugin.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ten lines of trivial set logic hardly seems like a big deal but
>>>>>>>>> any duplication is worth exploring.
>>>>>>>>>
>>>>>>>> It's more than ten lines. Take pulp_ansible for example. By my
>>>>>>>> count (the linked to section) it's 89 lines, which out of 306 lines of
>>>>>>>> plugin code for sync is 29% of extra redundant code. The other plugins have
>>>>>>>> similar numbers. So with those numbers in mind, what do you think?
>>>>>>>>
>>>>>>>>
>>>>>>>> I was counting the lines (w/o comments) in find_delta() based on
>>>>>>>> the linked code.  Which functions are you counting?
>>>>>>>>
>>>>>>>
>>>>>>> I was counting the find_delta, build_additions, and build_removals
>>>>>>> methods. Regardless of how the lines are counted, that differencing code is
>>>>>>> the duplication I'm talking about. There isn't a way to use the changesets
>>>>>>> without duplicating that differencing code in a plugin.
>>>>>>>
>>>>>>>
>>>>>>> The differencing code is limited to find_delta() and perhaps
>>>>>>> build_removals().  Agreed, the line count is less useful than specifically
>>>>>>> identifying duplicate code.  Outside of find_delta(), I see similar code
>>>>>>> (in part because it got copied from file plugin) but not seeing actual
>>>>>>> duplication.  Can you be more specific?
>>>>>>>
>>>>>>
>>>>>> Very similar code or identical code, I think it begs the question why
>>>>>> are we having plugin writer's do this at all? What value are they creating
>>>>>> with it? I don't have a reasonable answer to that question, so the
>>>>>> requirement for plugin writer's to write that code brings me back to the
>>>>>> problem statement: "plugin writers have redundant differencing code when
>>>>>> using Changesets". More info on why it is valuable for the plugin writer to
>>>>>> do the differencing code versus the Changesets would be helpful.
>>>>>>
>>>>>>
>>>>>> The ChangeSet abstraction (and API) is based on following division of
>>>>>> responsibility:
>>>>>>
>>>>>> The plugin  (with an understanding of the remote and its content):
>>>>>>   - Download metadata.
>>>>>>   - Parse metadata
>>>>>>   - Based on the metadata:
>>>>>>     - determine content to be added to the repository.
>>>>>>       - define how artifacts are downloaded.
>>>>>>       - construct content
>>>>>>     - determine content to be removed to the repository.
>>>>>>
>>>>>> Core (without understand of specific remote or its content):
>>>>>>   - Provide low level API for plugin to affect the changes it has
>>>>>> determined need to be made to the repository.  This is downloaders, models
>>>>>> etc.
>>>>>>   - Provide high(er) level API for plugin to affect the changes it
>>>>>> has determined need to be made to the repository.  This is the ChangeSet.
>>>>>>
>>>>>> Are you proposing that this is not the correct division?
>>>>>>
>>>>>
>>>>> Yes I believe these problem statements suggest we should adjust the
>>>>> plugin writer's responsibilities when interacting with the Changesets in
>>>>> two specific ways. It's not exactly the language you used, but I believe
>>>>> the following two responsibilities could be moved into the Changesets
>>>>> entirely:
>>>>>
>>>>> - determining if any given Artifact or Content unit is already present
>>>>> in Pulp (aka computing what needs tobe added)
>>>>>
>>>>>
>>>>> Did you mean *added* to the repository or *created* in pulp.
>>>>> Currently, the plugin determines the content that needs to be added to the
>>>>> repository.  This is modeled using a PendingContent which fully defines the
>>>>> Content (unit) and its PendingArtifact(s) which are included in the
>>>>> *additions*. The ChangeSet does determine whether or not any
>>>>> artifacts need to be downloaded (and downloads them based on policy) and
>>>>> determines which Content needs to be *created* vs simply added to the
>>>>> repository.  The plugin blindly assumes that none of the *pending*
>>>>> content has yet been created pulp.  This accomplishes 2 things.  1) reduces
>>>>> complexity and decision making by the plugin.  2) provides the ChangeSet
>>>>> with all the information needed to *create* and *download* as
>>>>> needed.  The *additions* represents what the plugin wants to be added
>>>>> to the repository to synchronize it with the remote repository.
>>>>>
>>>>> - determining which content units need to be removed (aka computing
>>>>> the removals)
>>>>>
>>>>>
>>>>> I don't see how the ChangeSet has enough information to do this.  The
>>>>> plugin can (most likely will) make the decision about what to remove based
>>>>> on remote metadata, policy and configuration.
>>>>>
>>>>>
>>>>> ^ goals are a restating of the problem statement that plugin writers
>>>>> are asked to do differencing calculations when the Changesets could provide
>>>>> that to the plugin writer instead.
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So a shorter, simpler problem statement is: "to use the changesets
>>>>>>> plugin writers have to do extra work to compute additions and removals
>>>>>>> parameters".
>>>>>>>
>>>>>>>
>>>>>>> This statement ^ is better but still too vague to actually solve.
>>>>>>> Can we elaborate on specifically what "to do extra work" means?
>>>>>>>
>>>>>>
>>>>>> Sure. Removing that vague language is one way to resolve its
>>>>>> vagueness. Here's a revised problem statement: "to use the
>>>>>> changesets plugin writers have to compute additions and removals
>>>>>> parameters". This problem statement would be resolved by a solution that
>>>>>> causes the plugin writer to never have to produce these parameters and be
>>>>>> replaced by an interface that would require less effort from a plugin
>>>>>> writer.
>>>>>>
>>>>>>
>>>>>> I think it's the plugin's responsibility to determine the
>>>>>> difference.  Aside from that: without an understanding of the metadata and
>>>>>> content type, how could the ChangeSet do this?  What might that looks like?
>>>>>>
>>>>>
>>>>> If I'm understanding this correctly, the Changesets already do this
>>>>> for additions right? Help check my understanding. If a plugin writer
>>>>> delivers PendingContent and PendingArtifacts to the Changesets as
>>>>> 'additions', the Changesets will recognize them as already downloaded and
>>>>> not download them right? If this is the case, what is the benefit of having
>>>>> plugin writers also try to figure out if things should be downloaded?
>>>>>
>>>>>
>>>>> As you pointed out, the plugin writer does not need to figure out what
>>>>> needs to be downloaded.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. Plugins can't do end-to-end stream processing. The Changesets
>>>>>>>>> themselves do stream processing, but when you call into
>>>>>>>>> changeset.apply_and_drain() you have to have fully parsed the metadata
>>>>>>>>> already. Currently when fetching all metadata from Galaxy, pulp_ansible
>>>>>>>>> takes about 380 seconds (6+ min). This means that the actual Changeset
>>>>>>>>> content downloading starts 380 seconds later than it could. At the heart of
>>>>>>>>> the problem, the fetching+parsing of the metadata is not part of the stream
>>>>>>>>> processing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The additions/removals can be any interable (like generator) and
>>>>>>>>> by using ChangeSet.apply() and iterating the returned object, the pluign
>>>>>>>>> can "turn the crank" while downloading and processing the metadata.  The
>>>>>>>>> ChangeSet.apply_and_drain() is just a convenience method.  I don't see how
>>>>>>>>> this is a limitation of the ChangeSet.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That is new info for me (and maybe everyone). OK so Changesets have
>>>>>>>> two interfaces. apply() and apply_and_drain(). Why do we have two
>>>>>>>> interfaces when apply() can support all existing use cases (that I know of)
>>>>>>>> and do end-to-end stream processing but apply_and_drain() cannot? I see all
>>>>>>>> of our examples (and all of our new plugins) using apply_and_drain().
>>>>>>>>
>>>>>>>>
>>>>>>>> The ChangeSet.apply() was how I designed (and documented) it.  Not
>>>>>>>> sure when/who added the apply_and_drain().  +1 for removing it.
>>>>>>>>
>>>>>>>
>>>>>>> I read through the changeset docs. I think this stream processing
>>>>>>> thing is still a problem but perhaps in how we're presenting the Changeset
>>>>>>> with it's arguments. I don't think apply() versus apply_and_drain() are at
>>>>>>> all related. Regardless of if you are using apply() or apply_and_drain(),
>>>>>>> the Changeset requires an 'additions' and 'removals' arguments. This sends
>>>>>>> a clear message to the plugin writer that they need to compute additions
>>>>>>> and removals. They will fetch the metadata to compute these which is mostly
>>>>>>> how the changeset documentation reads. To know that they could present a
>>>>>>> generator that would correctly allow the metdata from inside the Changeset
>>>>>>> is I feel as non-obvious. I want the high-performing implementation to be
>>>>>>> the obvious one.
>>>>>>>
>>>>>>> So what about a problem statement like this: "Changesets are
>>>>>>> presented such that when you call into them you should already have fetched
>>>>>>> the metadata"?
>>>>>>>
>>>>>>>
>>>>>>> I'm not sure what is meant by "presented".  If this means that we
>>>>>>> should provide an example of how the ChangeSet can be used by plugins (with
>>>>>>> large metadata) in such a way that does not require downloading all the
>>>>>>> metadata first - that sounds like a good idea.
>>>>>>>
>>>>>>
>>>>>> Cool so this is transitioning to ideas for resolution. The solution
>>>>>> to add documentation on how to do this with the existing interface is one
>>>>>> option. My concern with adding additional docs on how to use the current
>>>>>> interface better is that if users choose to follow the existing docs then
>>>>>> they will have the stream processing problem once again. To me, this
>>>>>> suggests that this new example should actually replace the existing
>>>>>> documentation.
>>>>>>
>>>>>>
>>>>>> Seems like both example would be useful.  I'm not convinced that all
>>>>>> plugins would benefit from this.  For example: the File plugin manifest is
>>>>>> small and would likely not benefit from the extra complexity.  For
>>>>>> complicated plugins (like RPM), can differencing decision be made before
>>>>>> analyzing the entire metadata (eg: primary.xml)?  Also, it's not clear to
>>>>>> me how this would work using the Downloader.  Are you suggesting that the
>>>>>> plugin would parse/process metadata files while they're being downloaded?
>>>>>> Perhaps a better understanding of the flow to be supported would help me
>>>>>> understand this.
>>>>>>
>>>>>
>>>>> Yes I am suggesting just that: that the Changesets could facilitate
>>>>> parse/processing metadata files while actual content named in those files
>>>>> is also being downloaded. I have a straightforward idea on how to achieve
>>>>> this. It's short and easy enough to write up (no code), but I want to make
>>>>> sure I'm not moving beyond the problem statement without others. Is there
>>>>> more we want to do on these problem statements, or would answering a bit
>>>>> about one way it could work be helpful?
>>>>>
>>>>>
>>>>> The *additions* can be (and usually is) a generator.  The generator
>>>>> can yield based on metadata as it is downloaded and digested.  In this way,
>>>>> the ChangeSet already facilitates this.
>>>>>
>>>>>
>>>>> Just to state my expectations: Moving beyond the problem statement I
>>>>> don't consider to be a commitment to solve it; just an agreement on what
>>>>> we're solving as we discuss various resolutions. Problem statements can
>>>>> also always be revisited. Either way forward is fine w/ me, just let me
>>>>> know how we should continue
>>>>>
>>>>>
>>>>> So far, I'm not convinced that any specific problems/deficiencies have
>>>>> been identified.  That said, you seem to have a different abstraction in
>>>>> mind. I would be interested in reviewing it and how it would be used by
>>>>> plugin writers.  It may help illustrate the gains that you are envisioning.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you see the same challenges I do? Are these the right problem
>>>>>>>>> statements? I think with clear problem statements a solution will be easy
>>>>>>>>> to see and agree on.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm not convinced that these are actual problems/challenges that
>>>>>>>>> need to be addressed in the near term.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> Brian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pulp-dev mailing listPulp-dev at redhat.comhttps://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Pulp-dev mailing list
>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180419/81cfd7fc/attachment.htm>