[Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel jortel at redhat.com
Mon Apr 16 16:21:27 UTC 2018

Thanks for the proposal, Brian.  I also commented on the issue.

On 04/16/2018 09:41 AM, Brian Bouterse wrote:
> I wrote up a description of the opportunity I see here [0]. I put a 
> high level pro/con analysis below. I would like feedback on (a) if 
> this adequately addresses the problem statements, (b) if there are 
> alternatives, and (c) does this improve the plugin wrtier's experience 
> enough to adopt this?
> pros:
> * significantly less plugin code to write. Compare the Thing example 
> code versus the current docs.

> * Higher performing with metadata downloading and parsing being 
> included in stream processing. This causes sync's for pulp_ansible to 
> start 6+ min earlier.

This could also be done currently with the ChangeSet as-is.

> cons:
> * Progress reporting doesn't know how many things it's processing 
> (it's a stream). So user's would see progress as "X things completed", 
> not "X of Y things completed". Y can't be known until just before the 
> stream processing completes otherwise it's not stream processing.

I'm not a fan of the SizedIterator either.
I contemplated this when designing the ChangeSet.  An alternative I 
considered was to report progress like OSTree does.  It reports progress 
by periodically updating the expected TOTAL.  It's better than nothing.

> [0]: https://pulp.plan.io/issues/3570
> Thanks!
> Brian
> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <jortel at redhat.com 
> <mailto:jortel at redhat.com>> wrote:
>     On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>     On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <jortel at redhat.com
>>     <mailto:jortel at redhat.com>> wrote:
>>         On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>>         On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel
>>>         <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>             On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>>             I think we should look into this in the near-term.
>>>>             Changing an interface on an object used by all plugins
>>>>             will be significantly easier, earlier.
>>>>             On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
>>>>             <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>>                 On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>>>                 On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
>>>>>                 <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>>>                     On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>>>>                     Several plugins have started using the
>>>>>>                     Changesets including pulp_ansible,
>>>>>>                     pulp_python, pulp_file, and perhaps others.
>>>>>>                     The Changesets provide several distinct
>>>>>>                     points of value which are great, but there
>>>>>>                     are two challenges I want to bring up. I want
>>>>>>                     to focus only on the problem statements first.
>>>>>>                     1. There is redundant "differencing" code in
>>>>>>                     all plugins. The Changeset interface requires
>>>>>>                     the plugin writer to determine what units
>>>>>>                     need to be added and those to be removed.
>>>>>>                     This requires all plugin writers to write the
>>>>>>                     same non-trivial differencing code over and
>>>>>>                     over. For example, you can see the same
>>>>>>                     non-trivial differencing code present in
>>>>>>                     pulp_ansible
>>>>>>                     <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>>>>                     pulp_file
>>>>>>                     <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>>>>                     and pulp_python
>>>>>>                     <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>>>>                     Line-wise, this "differencing" code makes up
>>>>>>                     a large portion (maybe 50%) of the sync code
>>>>>>                     itself in each plugin.
>>>>>                     Ten lines of trivial set logic hardly seems
>>>>>                     like a big deal but any duplication is worth
>>>>>                     exploring.
>>>>>                 It's more than ten lines. Take pulp_ansible for
>>>>>                 example. By my count (the linked to section) it's
>>>>>                 89 lines, which out of 306 lines of plugin code
>>>>>                 for sync is 29% of extra redundant code. The other
>>>>>                 plugins have similar numbers. So with those
>>>>>                 numbers in mind, what do you think?
>>>>                 I was counting the lines (w/o comments) in
>>>>                 find_delta() based on the linked code. Which
>>>>                 functions are you counting?
>>>>             I was counting the find_delta, build_additions, and
>>>>             build_removals methods. Regardless of how the lines are
>>>>             counted, that differencing code is the duplication I'm
>>>>             talking about. There isn't a way to use the changesets
>>>>             without duplicating that differencing code in a plugin.
>>>             The differencing code is limited to find_delta() and
>>>             perhaps build_removals().  Agreed, the line count is
>>>             less useful than specifically identifying duplicate
>>>             code. Outside of find_delta(), I see similar code (in
>>>             part because it got copied from file plugin) but not
>>>             seeing actual duplication.  Can you be more specific?
>>>         Very similar code or identical code, I think it begs the
>>>         question why are we having plugin writer's do this at all?
>>>         What value are they creating with it? I don't have a
>>>         reasonable answer to that question, so the requirement for
>>>         plugin writer's to write that code brings me back to the
>>>         problem statement: "plugin writers have redundant
>>>         differencing code when using Changesets". More info on why
>>>         it is valuable for the plugin writer to do the differencing
>>>         code versus the Changesets would be helpful.
>>         The ChangeSet abstraction (and API) is based on following
>>         division of responsibility:
>>         The plugin  (with an understanding of the remote and its
>>         content):
>>           - Download metadata.
>>           - Parse metadata
>>           - Based on the metadata:
>>             - determine content to be added to the repository.
>>               - define how artifacts are downloaded.
>>               - construct content
>>             - determine content to be removed to the repository.
>>         Core (without understand of specific remote or its content):
>>           - Provide low level API for plugin to affect the changes it
>>         has determined need to be made to the repository.  This is
>>         downloaders, models etc.
>>           - Provide high(er) level API for plugin to affect the
>>         changes it has determined need to be made to the repository. 
>>         This is the ChangeSet.
>>         Are you proposing that this is not the correct division?
>>     Yes I believe these problem statements suggest we should adjust
>>     the plugin writer's responsibilities when interacting with the
>>     Changesets in two specific ways. It's not exactly the language
>>     you used, but I believe the following two responsibilities could
>>     be moved into the Changesets entirely:
>>     - determining if any given Artifact or Content unit is already
>>     present in Pulp (aka computing what needs tobe added)
>     Did you mean /added/ to the repository or /created/ in pulp. 
>     Currently, the plugin determines the content that needs to be
>     added to the repository.  This is modeled using a PendingContent
>     which fully defines the Content (unit) and its PendingArtifact(s)
>     which are included in the /additions/. The ChangeSet does
>     determine whether or not any artifacts need to be downloaded (and
>     downloads them based on policy) and determines which Content needs
>     to be /created/ vs simply added to the repository.  The plugin
>     blindly assumes that none of the /pending/ content has yet been
>     created pulp.  This accomplishes 2 things.  1) reduces complexity
>     and decision making by the plugin.  2) provides the ChangeSet with
>     all the information needed to /create/ and /download/ as needed. 
>     The /additions/ represents what the plugin wants to be added to
>     the repository to synchronize it with the remote repository.
>>     - determining which content units need to be removed (aka
>>     computing the removals)
>     I don't see how the ChangeSet has enough information to do this. 
>     The plugin can (most likely will) make the decision about what to
>     remove based on remote metadata, policy and configuration.
>>     ^ goals are a restating of the problem statement that plugin
>>     writers are asked to do differencing calculations when the
>>     Changesets could provide that to the plugin writer instead.
>>>>             So a shorter, simpler problem statement is: "to use the
>>>>             changesets plugin writers have to do extra work to
>>>>             compute additions and removals parameters".
>>>             This statement ^ is better but still too vague to
>>>             actually solve.  Can we elaborate on specifically what
>>>             "to do extra work" means?
>>>         Sure. Removing that vague language is one way to resolve its
>>>         vagueness. Here's a revised problem statement: "to use the
>>>         changesets plugin writers have to compute additions and
>>>         removals parameters". This problem statement would be
>>>         resolved by a solution that causes the plugin writer to
>>>         never have to produce these parameters and be replaced by an
>>>         interface that would require less effort from a plugin writer.
>>         I think it's the plugin's responsibility to determine the
>>         difference. Aside from that: without an understanding of the
>>         metadata and content type, how could the ChangeSet do this? 
>>         What might that looks like?
>>     If I'm understanding this correctly, the Changesets already do
>>     this for additions right? Help check my understanding. If a
>>     plugin writer delivers PendingContent and PendingArtifacts to the
>>     Changesets as 'additions', the Changesets will recognize them as
>>     already downloaded and not download them right? If this is the
>>     case, what is the benefit of having plugin writers also try to
>>     figure out if things should be downloaded?
>     As you pointed out, the plugin writer does not need to figure out
>     what needs to be downloaded.
>>>>>>                     2. Plugins can't do end-to-end stream
>>>>>>                     processing. The Changesets themselves do
>>>>>>                     stream processing, but when you call into
>>>>>>                     changeset.apply_and_drain() you have to have
>>>>>>                     fully parsed the metadata already. Currently
>>>>>>                     when fetching all metadata from Galaxy,
>>>>>>                     pulp_ansible takes about 380 seconds (6+
>>>>>>                     min). This means that the actual Changeset
>>>>>>                     content downloading starts 380 seconds later
>>>>>>                     than it could. At the heart of the problem,
>>>>>>                     the fetching+parsing of the metadata is not
>>>>>>                     part of the stream processing.
>>>>>                     The additions/removals can be any interable
>>>>>                     (like generator) and by using
>>>>>                     ChangeSet.apply() and iterating the returned
>>>>>                     object, the pluign can "turn the crank" while
>>>>>                     downloading and processing the metadata. The
>>>>>                     ChangeSet.apply_and_drain() is just a
>>>>>                     convenience method.  I don't see how this is a
>>>>>                     limitation of the ChangeSet.
>>>>>                 That is new info for me (and maybe everyone). OK
>>>>>                 so Changesets have two interfaces. apply() and
>>>>>                 apply_and_drain(). Why do we have two interfaces
>>>>>                 when apply() can support all existing use cases
>>>>>                 (that I know of) and do end-to-end stream
>>>>>                 processing but apply_and_drain() cannot? I see all
>>>>>                 of our examples (and all of our new plugins) using
>>>>>                 apply_and_drain().
>>>>                 The ChangeSet.apply() was how I designed (and
>>>>                 documented) it.  Not sure when/who added the
>>>>                 apply_and_drain(). +1 for removing it.
>>>>             I read through the changeset docs. I think this stream
>>>>             processing thing is still a problem but perhaps in how
>>>>             we're presenting the Changeset with it's arguments. I
>>>>             don't think apply() versus apply_and_drain() are at all
>>>>             related. Regardless of if you are using apply() or
>>>>             apply_and_drain(), the Changeset requires an
>>>>             'additions' and 'removals' arguments. This sends a
>>>>             clear message to the plugin writer that they need to
>>>>             compute additions and removals. They will fetch the
>>>>             metadata to compute these which is mostly how the
>>>>             changeset documentation reads. To know that they could
>>>>             present a generator that would correctly allow the
>>>>             metdata from inside the Changeset is I feel as
>>>>             non-obvious. I want the high-performing implementation
>>>>             to be the obvious one.
>>>>             So what about a problem statement like this:
>>>>             "Changesets are presented such that when you call into
>>>>             them you should already have fetched the metadata"?
>>>             I'm not sure what is meant by "presented".  If this
>>>             means that we should provide an example of how the
>>>             ChangeSet can be used by plugins (with large metadata)
>>>             in such a way that does not require downloading all the
>>>             metadata first - that sounds like a good idea.
>>>         Cool so this is transitioning to ideas for resolution. The
>>>         solution to add documentation on how to do this with the
>>>         existing interface is one option. My concern with adding
>>>         additional docs on how to use the current interface better
>>>         is that if users choose to follow the existing docs then
>>>         they will have the stream processing problem once again. To
>>>         me, this suggests that this new example should actually
>>>         replace the existing documentation.
>>         Seems like both example would be useful.  I'm not convinced
>>         that all plugins would benefit from this.  For example: the
>>         File plugin manifest is small and would likely not benefit
>>         from the extra complexity.  For complicated plugins (like
>>         RPM), can differencing decision be made before analyzing the
>>         entire metadata (eg: primary.xml)?  Also, it's not clear to
>>         me how this would work using the Downloader. Are you
>>         suggesting that the plugin would parse/process metadata files
>>         while they're being downloaded?  Perhaps a better
>>         understanding of the flow to be supported would help me
>>         understand this.
>>     Yes I am suggesting just that: that the Changesets could
>>     facilitate parse/processing metadata files while actual content
>>     named in those files is also being downloaded. I have a
>>     straightforward idea on how to achieve this. It's short and easy
>>     enough to write up (no code), but I want to make sure I'm not
>>     moving beyond the problem statement without others. Is there more
>>     we want to do on these problem statements, or would answering a
>>     bit about one way it could work be helpful?
>     The /additions/ can be (and usually is) a generator.  The
>     generator can yield based on metadata as it is downloaded and
>     digested.  In this way, the ChangeSet already facilitates this.
>>     Just to state my expectations: Moving beyond the problem
>>     statement I don't consider to be a commitment to solve it; just
>>     an agreement on what we're solving as we discuss various
>>     resolutions. Problem statements can also always be revisited.
>>     Either way forward is fine w/ me, just let me know how we should
>>     continue
>     So far, I'm not convinced that any specific problems/deficiencies
>     have been identified.  That said, you seem to have a different
>     abstraction in mind. I would be interested in reviewing it and how
>     it would be used by plugin writers.  It may help illustrate the
>     gains that you are envisioning.
>>>>>>                     Do you see the same challenges I do? Are
>>>>>>                     these the right problem statements? I think
>>>>>>                     with clear problem statements a solution will
>>>>>>                     be easy to see and agree on.
>>>>>                     I'm not convinced that these are actual
>>>>>                     problems/challenges that need to be addressed
>>>>>                     in the near term.
>>>>>>                     Thanks!
>>>>>>                     Brian
>>>>>>                     _______________________________________________
>>>>>>                     Pulp-dev mailing list
>>>>>>                     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>>>                     https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>                     <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>>>>                     _______________________________________________
>>>>>                     Pulp-dev mailing list
>>>>>                     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>>                     https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>                     <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>         _______________________________________________
>>         Pulp-dev mailing list
>>         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>         https://www.redhat.com/mailman/listinfo/pulp-dev
>>         <https://www.redhat.com/mailman/listinfo/pulp-dev>
>     _______________________________________________
>     Pulp-dev mailing list
>     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>     https://www.redhat.com/mailman/listinfo/pulp-dev
>     <https://www.redhat.com/mailman/listinfo/pulp-dev>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180416/d31508ca/attachment.htm>

More information about the Pulp-dev mailing list