[Pulp-dev] Fwd: Re: Changesets Challenges

Brian Bouterse bbouters at redhat.com
Wed Apr 11 20:29:23 UTC 2018


I think we should look into this in the near-term. Changing an interface on
an object used by all plugins will be significantly easier, earlier.


On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jortel at redhat.com> wrote:

>
>
> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>
>
>
> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jortel at redhat.com> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>
>> Several plugins have started using the Changesets including pulp_ansible,
>> pulp_python, pulp_file, and perhaps others. The Changesets provide several
>> distinct points of value which are great, but there are two challenges I
>> want to bring up. I want to focus only on the problem statements first.
>>
>> 1. There is redundant "differencing" code in all plugins. The Changeset
>> interface requires the plugin writer to determine what units need to be
>> added and those to be removed. This requires all plugin writers to write
>> the same non-trivial differencing code over and over. For example, you can
>> see the same non-trivial differencing code present in pulp_ansible
>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>> pulp_file
>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>> and pulp_python
>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>> Line-wise, this "differencing" code makes up a large portion (maybe 50%) of
>> the sync code itself in each plugin.
>>
>>
>> Ten lines of trivial set logic hardly seems like a big deal but any
>> duplication is worth exploring.
>>
> It's more than ten lines. Take pulp_ansible for example. By my count (the
> linked to section) it's 89 lines, which out of 306 lines of plugin code for
> sync is 29% of extra redundant code. The other plugins have similar
> numbers. So with those numbers in mind, what do you think?
>
>
> I was counting the lines (w/o comments) in find_delta() based on the
> linked code.  Which functions are you counting?
>

I was counting the find_delta, build_additions, and build_removals methods.
Regardless of how the lines are counted, that differencing code is the
duplication I'm talking about. There isn't a way to use the changesets
without duplicating that differencing code in a plugin.

So a shorter, simpler problem statement is: "to use the changesets plugin
writers have to do extra work to compute additions and removals parameters".


>
>
>>
>> 2. Plugins can't do end-to-end stream processing. The Changesets
>> themselves do stream processing, but when you call into
>> changeset.apply_and_drain() you have to have fully parsed the metadata
>> already. Currently when fetching all metadata from Galaxy, pulp_ansible
>> takes about 380 seconds (6+ min). This means that the actual Changeset
>> content downloading starts 380 seconds later than it could. At the heart of
>> the problem, the fetching+parsing of the metadata is not part of the stream
>> processing.
>>
>>
>> The additions/removals can be any interable (like generator) and by using
>> ChangeSet.apply() and iterating the returned object, the pluign can "turn
>> the crank" while downloading and processing the metadata.  The
>> ChangeSet.apply_and_drain() is just a convenience method.  I don't see how
>> this is a limitation of the ChangeSet.
>>
>
> That is new info for me (and maybe everyone). OK so Changesets have two
> interfaces. apply() and apply_and_drain(). Why do we have two interfaces
> when apply() can support all existing use cases (that I know of) and do
> end-to-end stream processing but apply_and_drain() cannot? I see all of our
> examples (and all of our new plugins) using apply_and_drain().
>
>
> The ChangeSet.apply() was how I designed (and documented) it.  Not sure
> when/who added the apply_and_drain().  +1 for removing it.
>

I read through the changeset docs. I think this stream processing thing is
still a problem but perhaps in how we're presenting the Changeset with it's
arguments. I don't think apply() versus apply_and_drain() are at all
related. Regardless of if you are using apply() or apply_and_drain(), the
Changeset requires an 'additions' and 'removals' arguments. This sends a
clear message to the plugin writer that they need to compute additions and
removals. They will fetch the metadata to compute these which is mostly how
the changeset documentation reads. To know that they could present a
generator that would correctly allow the metdata from inside the Changeset
is I feel as non-obvious. I want the high-performing implementation to be
the obvious one.

So what about a problem statement like this: "Changesets are presented such
that when you call into them you should already have fetched the metadata"?


>
>
>
>>
>>
>> Do you see the same challenges I do? Are these the right problem
>> statements? I think with clear problem statements a solution will be easy
>> to see and agree on.
>>
>>
>> I'm not convinced that these are actual problems/challenges that need to
>> be addressed in the near term.
>>
>>
>> Thanks!
>> Brian
>>
>>
>> _______________________________________________
>> Pulp-dev mailing listPulp-dev at redhat.comhttps://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180411/5bf8decf/attachment.htm>


More information about the Pulp-dev mailing list