[Pulp-dev] Fwd: Re: Changesets Challenges

Wed Apr 11 22:07:53 UTC 2018

On 04/11/2018 03:29 PM, Brian Bouterse wrote:
> I think we should look into this in the near-term. Changing an 
> interface on an object used by all plugins will be significantly 
> easier, earlier.
>
>
> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jortel at redhat.com 
> <mailto:jortel at redhat.com>> wrote:
>
>
>
>     On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>
>>
>>     On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jortel at redhat.com
>>     <mailto:jortel at redhat.com>> wrote:
>>
>>
>>
>>
>>
>>         	
>>
>>         	
>>
>>         	
>>
>>         	
>>
>>
>>         On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>         Several plugins have started using the Changesets including
>>>         pulp_ansible, pulp_python, pulp_file, and perhaps others.
>>>         The Changesets provide several distinct points of value
>>>         which are great, but there are two challenges I want to
>>>         bring up. I want to focus only on the problem statements first.
>>>
>>>         1. There is redundant "differencing" code in all plugins.
>>>         The Changeset interface requires the plugin writer to
>>>         determine what units need to be added and those to be
>>>         removed. This requires all plugin writers to write the same
>>>         non-trivial differencing code over and over. For example,
>>>         you can see the same non-trivial differencing code present
>>>         in pulp_ansible
>>>         <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>         pulp_file
>>>         <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>         and pulp_python
>>>         <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>         Line-wise, this "differencing" code makes up a large portion
>>>         (maybe 50%) of the sync code itself in each plugin.
>>
>>         Ten lines of trivial set logic hardly seems like a big deal
>>         but any duplication is worth exploring.
>>
>>     It's more than ten lines. Take pulp_ansible for example. By my
>>     count (the linked to section) it's 89 lines, which out of 306
>>     lines of plugin code for sync is 29% of extra redundant code. The
>>     other plugins have similar numbers. So with those numbers in
>>     mind, what do you think?
>
>     I was counting the lines (w/o comments) in find_delta() based on
>     the linked code.  Which functions are you counting?
>
>
> I was counting the find_delta, build_additions, and build_removals 
> methods. Regardless of how the lines are counted, that differencing 
> code is the duplication I'm talking about. There isn't a way to use 
> the changesets without duplicating that differencing code in a plugin.

The differencing code is limited to find_delta() and perhaps 
build_removals().  Agreed, the line count is less useful than 
specifically identifying duplicate code.  Outside of find_delta(), I see 
similar code (in part because it got copied from file plugin) but not 
seeing actual duplication.  Can you be more specific?

>
> So a shorter, simpler problem statement is: "to use the changesets 
> plugin writers have to do extra work to compute additions and removals 
> parameters".

This statement ^ is better but still too vague to actually solve. Can we 
elaborate on specifically what "to do extra work" means?

>
>>
>>>
>>>         2. Plugins can't do end-to-end stream processing. The
>>>         Changesets themselves do stream processing, but when you
>>>         call into changeset.apply_and_drain() you have to have fully
>>>         parsed the metadata already. Currently when fetching all
>>>         metadata from Galaxy, pulp_ansible takes about 380 seconds
>>>         (6+ min). This means that the actual Changeset content
>>>         downloading starts 380 seconds later than it could. At the
>>>         heart of the problem, the fetching+parsing of the metadata
>>>         is not part of the stream processing.
>>
>>         The additions/removals can be any interable (like generator)
>>         and by using ChangeSet.apply() and iterating the returned
>>         object, the pluign can "turn the crank" while downloading and
>>         processing the metadata. The ChangeSet.apply_and_drain() is
>>         just a convenience method.  I don't see how this is a
>>         limitation of the ChangeSet.
>>
>>
>>     That is new info for me (and maybe everyone). OK so Changesets
>>     have two interfaces. apply() and apply_and_drain(). Why do we
>>     have two interfaces when apply() can support all existing use
>>     cases (that I know of) and do end-to-end stream processing but
>>     apply_and_drain() cannot? I see all of our examples (and all of
>>     our new plugins) using apply_and_drain().
>
>     The ChangeSet.apply() was how I designed (and documented) it.  Not
>     sure when/who added the apply_and_drain().  +1 for removing it.
>
>
> I read through the changeset docs. I think this stream processing 
> thing is still a problem but perhaps in how we're presenting the 
> Changeset with it's arguments. I don't think apply() versus 
> apply_and_drain() are at all related. Regardless of if you are using 
> apply() or apply_and_drain(), the Changeset requires an 'additions' 
> and 'removals' arguments. This sends a clear message to the plugin 
> writer that they need to compute additions and removals. They will 
> fetch the metadata to compute these which is mostly how the changeset 
> documentation reads. To know that they could present a generator that 
> would correctly allow the metdata from inside the Changeset is I feel 
> as non-obvious. I want the high-performing implementation to be the 
> obvious one.
>
> So what about a problem statement like this: "Changesets are presented 
> such that when you call into them you should already have fetched the 
> metadata"?

I'm not sure what is meant by "presented".  If this means that we should 
provide an example of how the ChangeSet can be used by plugins (with 
large metadata) in such a way that does not require downloading all the 
metadata first - that sounds like a good idea.

>
>
>>
>>>
>>>         Do you see the same challenges I do? Are these the right
>>>         problem statements? I think with clear problem statements a
>>>         solution will be easy to see and agree on.
>>
>>         I'm not convinced that these are actual problems/challenges
>>         that need to be addressed in the near term.
>>
>>>
>>>         Thanks!
>>>         Brian
>>>
>>>
>>>         _______________________________________________
>>>         Pulp-dev mailing list
>>>         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>         https://www.redhat.com/mailman/listinfo/pulp-dev
>>>         <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>
>>
>>         _______________________________________________
>>         Pulp-dev mailing list
>>         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>         https://www.redhat.com/mailman/listinfo/pulp-dev
>>         <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180411/69b76c12/attachment.htm>