[Pulp-dev] Fwd: Re: Changesets Challenges
Jeff Ortel
jortel at redhat.com
Wed Apr 11 22:07:53 UTC 2018
On 04/11/2018 03:29 PM, Brian Bouterse wrote:
> I think we should look into this in the near-term. Changing an
> interface on an object used by all plugins will be significantly
> easier, earlier.
>
>
> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel <jortel at redhat.com
> <mailto:jortel at redhat.com>> wrote:
>
>
>
> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>
>>
>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel <jortel at redhat.com
>> <mailto:jortel at redhat.com>> wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>> Several plugins have started using the Changesets including
>>> pulp_ansible, pulp_python, pulp_file, and perhaps others.
>>> The Changesets provide several distinct points of value
>>> which are great, but there are two challenges I want to
>>> bring up. I want to focus only on the problem statements first.
>>>
>>> 1. There is redundant "differencing" code in all plugins.
>>> The Changeset interface requires the plugin writer to
>>> determine what units need to be added and those to be
>>> removed. This requires all plugin writers to write the same
>>> non-trivial differencing code over and over. For example,
>>> you can see the same non-trivial differencing code present
>>> in pulp_ansible
>>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>> pulp_file
>>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>> and pulp_python
>>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>> Line-wise, this "differencing" code makes up a large portion
>>> (maybe 50%) of the sync code itself in each plugin.
>>
>> Ten lines of trivial set logic hardly seems like a big deal
>> but any duplication is worth exploring.
>>
>> It's more than ten lines. Take pulp_ansible for example. By my
>> count (the linked to section) it's 89 lines, which out of 306
>> lines of plugin code for sync is 29% of extra redundant code. The
>> other plugins have similar numbers. So with those numbers in
>> mind, what do you think?
>
> I was counting the lines (w/o comments) in find_delta() based on
> the linked code. Which functions are you counting?
>
>
> I was counting the find_delta, build_additions, and build_removals
> methods. Regardless of how the lines are counted, that differencing
> code is the duplication I'm talking about. There isn't a way to use
> the changesets without duplicating that differencing code in a plugin.
The differencing code is limited to find_delta() and perhaps
build_removals(). Agreed, the line count is less useful than
specifically identifying duplicate code. Outside of find_delta(), I see
similar code (in part because it got copied from file plugin) but not
seeing actual duplication. Can you be more specific?
>
> So a shorter, simpler problem statement is: "to use the changesets
> plugin writers have to do extra work to compute additions and removals
> parameters".
This statement ^ is better but still too vague to actually solve. Can we
elaborate on specifically what "to do extra work" means?
>
>>
>>>
>>> 2. Plugins can't do end-to-end stream processing. The
>>> Changesets themselves do stream processing, but when you
>>> call into changeset.apply_and_drain() you have to have fully
>>> parsed the metadata already. Currently when fetching all
>>> metadata from Galaxy, pulp_ansible takes about 380 seconds
>>> (6+ min). This means that the actual Changeset content
>>> downloading starts 380 seconds later than it could. At the
>>> heart of the problem, the fetching+parsing of the metadata
>>> is not part of the stream processing.
>>
>> The additions/removals can be any interable (like generator)
>> and by using ChangeSet.apply() and iterating the returned
>> object, the pluign can "turn the crank" while downloading and
>> processing the metadata. The ChangeSet.apply_and_drain() is
>> just a convenience method. I don't see how this is a
>> limitation of the ChangeSet.
>>
>>
>> That is new info for me (and maybe everyone). OK so Changesets
>> have two interfaces. apply() and apply_and_drain(). Why do we
>> have two interfaces when apply() can support all existing use
>> cases (that I know of) and do end-to-end stream processing but
>> apply_and_drain() cannot? I see all of our examples (and all of
>> our new plugins) using apply_and_drain().
>
> The ChangeSet.apply() was how I designed (and documented) it. Not
> sure when/who added the apply_and_drain(). +1 for removing it.
>
>
> I read through the changeset docs. I think this stream processing
> thing is still a problem but perhaps in how we're presenting the
> Changeset with it's arguments. I don't think apply() versus
> apply_and_drain() are at all related. Regardless of if you are using
> apply() or apply_and_drain(), the Changeset requires an 'additions'
> and 'removals' arguments. This sends a clear message to the plugin
> writer that they need to compute additions and removals. They will
> fetch the metadata to compute these which is mostly how the changeset
> documentation reads. To know that they could present a generator that
> would correctly allow the metdata from inside the Changeset is I feel
> as non-obvious. I want the high-performing implementation to be the
> obvious one.
>
> So what about a problem statement like this: "Changesets are presented
> such that when you call into them you should already have fetched the
> metadata"?
I'm not sure what is meant by "presented". If this means that we should
provide an example of how the ChangeSet can be used by plugins (with
large metadata) in such a way that does not require downloading all the
metadata first - that sounds like a good idea.
>
>
>>
>>>
>>> Do you see the same challenges I do? Are these the right
>>> problem statements? I think with clear problem statements a
>>> solution will be easy to see and agree on.
>>
>> I'm not convinced that these are actual problems/challenges
>> that need to be addressed in the near term.
>>
>>>
>>> Thanks!
>>> Brian
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180411/69b76c12/attachment.htm>
More information about the Pulp-dev
mailing list