[Pulp-dev] Fwd: Re: Changesets Challenges

Jeff Ortel jortel at redhat.com
Thu Apr 12 23:12:23 UTC 2018

On 04/12/2018 04:00 PM, Brian Bouterse wrote:
> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <jortel at redhat.com 
> <mailto:jortel at redhat.com>> wrote:
>     On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>     On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel <jortel at redhat.com
>>     <mailto:jortel at redhat.com>> wrote:
>>         On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>         I think we should look into this in the near-term. Changing
>>>         an interface on an object used by all plugins will be
>>>         significantly easier, earlier.
>>>         On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
>>>         <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>             On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>>             On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
>>>>             <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>>                 On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>>>                 Several plugins have started using the Changesets
>>>>>                 including pulp_ansible, pulp_python, pulp_file,
>>>>>                 and perhaps others. The Changesets provide several
>>>>>                 distinct points of value which are great, but
>>>>>                 there are two challenges I want to bring up. I
>>>>>                 want to focus only on the problem statements first.
>>>>>                 1. There is redundant "differencing" code in all
>>>>>                 plugins. The Changeset interface requires the
>>>>>                 plugin writer to determine what units need to be
>>>>>                 added and those to be removed. This requires all
>>>>>                 plugin writers to write the same non-trivial
>>>>>                 differencing code over and over. For example, you
>>>>>                 can see the same non-trivial differencing code
>>>>>                 present in pulp_ansible
>>>>>                 <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>>>                 pulp_file
>>>>>                 <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>>>                 and pulp_python
>>>>>                 <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>>>                 Line-wise, this "differencing" code makes up a
>>>>>                 large portion (maybe 50%) of the sync code itself
>>>>>                 in each plugin.
>>>>                 Ten lines of trivial set logic hardly seems like a
>>>>                 big deal but any duplication is worth exploring.
>>>>             It's more than ten lines. Take pulp_ansible for
>>>>             example. By my count (the linked to section) it's 89
>>>>             lines, which out of 306 lines of plugin code for sync
>>>>             is 29% of extra redundant code. The other plugins have
>>>>             similar numbers. So with those numbers in mind, what do
>>>>             you think?
>>>             I was counting the lines (w/o comments) in find_delta()
>>>             based on the linked code. Which functions are you counting?
>>>         I was counting the find_delta, build_additions, and
>>>         build_removals methods. Regardless of how the lines are
>>>         counted, that differencing code is the duplication I'm
>>>         talking about. There isn't a way to use the changesets
>>>         without duplicating that differencing code in a plugin.
>>         The differencing code is limited to find_delta() and perhaps
>>         build_removals().  Agreed, the line count is less useful than
>>         specifically identifying duplicate code.  Outside of
>>         find_delta(), I see similar code (in part because it got
>>         copied from file plugin) but not seeing actual duplication. 
>>         Can you be more specific?
>>     Very similar code or identical code, I think it begs the question
>>     why are we having plugin writer's do this at all? What value are
>>     they creating with it? I don't have a reasonable answer to that
>>     question, so the requirement for plugin writer's to write that
>>     code brings me back to the problem statement: "plugin writers
>>     have redundant differencing code when using Changesets". More
>>     info on why it is valuable for the plugin writer to do the
>>     differencing code versus the Changesets would be helpful.
>     The ChangeSet abstraction (and API) is based on following division
>     of responsibility:
>     The plugin  (with an understanding of the remote and its content):
>       - Download metadata.
>       - Parse metadata
>       - Based on the metadata:
>         - determine content to be added to the repository.
>           - define how artifacts are downloaded.
>           - construct content
>         - determine content to be removed to the repository.
>     Core (without understand of specific remote or its content):
>       - Provide low level API for plugin to affect the changes it has
>     determined need to be made to the repository.  This is
>     downloaders, models etc.
>       - Provide high(er) level API for plugin to affect the changes it
>     has determined need to be made to the repository.  This is the
>     ChangeSet.
>     Are you proposing that this is not the correct division?
> Yes I believe these problem statements suggest we should adjust the 
> plugin writer's responsibilities when interacting with the Changesets 
> in two specific ways. It's not exactly the language you used, but I 
> believe the following two responsibilities could be moved into the 
> Changesets entirely:
> - determining if any given Artifact or Content unit is already present 
> in Pulp (aka computing what needs tobe added)

Did you mean /added/ to the repository or /created/ in pulp.  Currently, 
the plugin determines the content that needs to be added to the 
repository.  This is modeled using a PendingContent which fully defines 
the Content (unit) and its PendingArtifact(s) which are included in the 
/additions/. The ChangeSet does determine whether or not any artifacts 
need to be downloaded (and downloads them based on policy) and 
determines which Content needs to be /created/ vs simply added to the 
repository.  The plugin blindly assumes that none of the /pending/ 
content has yet been created pulp.  This accomplishes 2 things.  1) 
reduces complexity and decision making by the plugin.  2) provides the 
ChangeSet with all the information needed to /create/ and /download/ as 
needed.  The /additions/ represents what the plugin wants to be added to 
the repository to synchronize it with the remote repository.

> - determining which content units need to be removed (aka computing 
> the removals)

I don't see how the ChangeSet has enough information to do this. The 
plugin can (most likely will) make the decision about what to remove 
based on remote metadata, policy and configuration.

> ^ goals are a restating of the problem statement that plugin writers 
> are asked to do differencing calculations when the Changesets could 
> provide that to the plugin writer instead.
>>>         So a shorter, simpler problem statement is: "to use the
>>>         changesets plugin writers have to do extra work to compute
>>>         additions and removals parameters".
>>         This statement ^ is better but still too vague to actually
>>         solve.  Can we elaborate on specifically what "to do extra
>>         work" means?
>>     Sure. Removing that vague language is one way to resolve its
>>     vagueness. Here's a revised problem statement: "to use the
>>     changesets plugin writers have to compute additions and removals
>>     parameters". This problem statement would be resolved by a
>>     solution that causes the plugin writer to never have to produce
>>     these parameters and be replaced by an interface that would
>>     require less effort from a plugin writer.
>     I think it's the plugin's responsibility to determine the
>     difference.  Aside from that: without an understanding of the
>     metadata and content type, how could the ChangeSet do this?  What
>     might that looks like?
> If I'm understanding this correctly, the Changesets already do this 
> for additions right? Help check my understanding. If a plugin writer 
> delivers PendingContent and PendingArtifacts to the Changesets as 
> 'additions', the Changesets will recognize them as already downloaded 
> and not download them right? If this is the case, what is the benefit 
> of having plugin writers also try to figure out if things should be 
> downloaded?

As you pointed out, the plugin writer does not need to figure out what 
needs to be downloaded.

>>>>>                 2. Plugins can't do end-to-end stream processing.
>>>>>                 The Changesets themselves do stream processing,
>>>>>                 but when you call into changeset.apply_and_drain()
>>>>>                 you have to have fully parsed the metadata
>>>>>                 already. Currently when fetching all metadata from
>>>>>                 Galaxy, pulp_ansible takes about 380 seconds (6+
>>>>>                 min). This means that the actual Changeset content
>>>>>                 downloading starts 380 seconds later than it
>>>>>                 could. At the heart of the problem, the
>>>>>                 fetching+parsing of the metadata is not part of
>>>>>                 the stream processing.
>>>>                 The additions/removals can be any interable (like
>>>>                 generator) and by using ChangeSet.apply() and
>>>>                 iterating the returned object, the pluign can "turn
>>>>                 the crank" while downloading and processing the
>>>>                 metadata. The ChangeSet.apply_and_drain() is just a
>>>>                 convenience method.  I don't see how this is a
>>>>                 limitation of the ChangeSet.
>>>>             That is new info for me (and maybe everyone). OK so
>>>>             Changesets have two interfaces. apply() and
>>>>             apply_and_drain(). Why do we have two interfaces when
>>>>             apply() can support all existing use cases (that I know
>>>>             of) and do end-to-end stream processing but
>>>>             apply_and_drain() cannot? I see all of our examples
>>>>             (and all of our new plugins) using apply_and_drain().
>>>             The ChangeSet.apply() was how I designed (and
>>>             documented) it.  Not sure when/who added the
>>>             apply_and_drain().  +1 for removing it.
>>>         I read through the changeset docs. I think this stream
>>>         processing thing is still a problem but perhaps in how we're
>>>         presenting the Changeset with it's arguments. I don't think
>>>         apply() versus apply_and_drain() are at all related.
>>>         Regardless of if you are using apply() or apply_and_drain(),
>>>         the Changeset requires an 'additions' and 'removals'
>>>         arguments. This sends a clear message to the plugin writer
>>>         that they need to compute additions and removals. They will
>>>         fetch the metadata to compute these which is mostly how the
>>>         changeset documentation reads. To know that they could
>>>         present a generator that would correctly allow the metdata
>>>         from inside the Changeset is I feel as non-obvious. I want
>>>         the high-performing implementation to be the obvious one.
>>>         So what about a problem statement like this: "Changesets are
>>>         presented such that when you call into them you should
>>>         already have fetched the metadata"?
>>         I'm not sure what is meant by "presented".  If this means
>>         that we should provide an example of how the ChangeSet can be
>>         used by plugins (with large metadata) in such a way that does
>>         not require downloading all the metadata first - that sounds
>>         like a good idea.
>>     Cool so this is transitioning to ideas for resolution. The
>>     solution to add documentation on how to do this with the existing
>>     interface is one option. My concern with adding additional docs
>>     on how to use the current interface better is that if users
>>     choose to follow the existing docs then they will have the stream
>>     processing problem once again. To me, this suggests that this new
>>     example should actually replace the existing documentation.
>     Seems like both example would be useful.  I'm not convinced that
>     all plugins would benefit from this. For example: the File plugin
>     manifest is small and would likely not benefit from the extra
>     complexity.  For complicated plugins (like RPM), can differencing
>     decision be made before analyzing the entire metadata (eg:
>     primary.xml)?  Also, it's not clear to me how this would work
>     using the Downloader.  Are you suggesting that the plugin would
>     parse/process metadata files while they're being downloaded? 
>     Perhaps a better understanding of the flow to be supported would
>     help me understand this.
> Yes I am suggesting just that: that the Changesets could facilitate 
> parse/processing metadata files while actual content named in those 
> files is also being downloaded. I have a straightforward idea on how 
> to achieve this. It's short and easy enough to write up (no code), but 
> I want to make sure I'm not moving beyond the problem statement 
> without others. Is there more we want to do on these problem 
> statements, or would answering a bit about one way it could work be 
> helpful?

The /additions/ can be (and usually is) a generator.  The generator can 
yield based on metadata as it is downloaded and digested.  In this way, 
the ChangeSet already facilitates this.

> Just to state my expectations: Moving beyond the problem statement I 
> don't consider to be a commitment to solve it; just an agreement on 
> what we're solving as we discuss various resolutions. Problem 
> statements can also always be revisited. Either way forward is fine w/ 
> me, just let me know how we should continue

So far, I'm not convinced that any specific problems/deficiencies have 
been identified.  That said, you seem to have a different abstraction in 
mind. I would be interested in reviewing it and how it would be used by 
plugin writers.  It may help illustrate the gains that you are envisioning.

>>>>>                 Do you see the same challenges I do? Are these the
>>>>>                 right problem statements? I think with clear
>>>>>                 problem statements a solution will be easy to see
>>>>>                 and agree on.
>>>>                 I'm not convinced that these are actual
>>>>                 problems/challenges that need to be addressed in
>>>>                 the near term.
>>>>>                 Thanks!
>>>>>                 Brian
>>>>>                 _______________________________________________
>>>>>                 Pulp-dev mailing list
>>>>>                 Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>>                 https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>                 <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>>>                 _______________________________________________
>>>>                 Pulp-dev mailing list
>>>>                 Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>                 https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>                 <https://www.redhat.com/mailman/listinfo/pulp-dev>
>     _______________________________________________
>     Pulp-dev mailing list
>     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>     https://www.redhat.com/mailman/listinfo/pulp-dev
>     <https://www.redhat.com/mailman/listinfo/pulp-dev>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180412/f7913b9d/attachment.htm>

More information about the Pulp-dev mailing list