[Pulp-dev] Fwd: Re: Changesets Challenges
Jeff Ortel
jortel at redhat.com
Mon Apr 16 16:21:27 UTC 2018
Thanks for the proposal, Brian. I also commented on the issue.
On 04/16/2018 09:41 AM, Brian Bouterse wrote:
> I wrote up a description of the opportunity I see here [0]. I put a
> high level pro/con analysis below. I would like feedback on (a) if
> this adequately addresses the problem statements, (b) if there are
> alternatives, and (c) does this improve the plugin wrtier's experience
> enough to adopt this?
>
> pros:
> * significantly less plugin code to write. Compare the Thing example
> code versus the current docs.
+1
> * Higher performing with metadata downloading and parsing being
> included in stream processing. This causes sync's for pulp_ansible to
> start 6+ min earlier.
This could also be done currently with the ChangeSet as-is.
>
> cons:
> * Progress reporting doesn't know how many things it's processing
> (it's a stream). So user's would see progress as "X things completed",
> not "X of Y things completed". Y can't be known until just before the
> stream processing completes otherwise it's not stream processing.
I'm not a fan of the SizedIterator either.
I contemplated this when designing the ChangeSet. An alternative I
considered was to report progress like OSTree does. It reports progress
by periodically updating the expected TOTAL. It's better than nothing.
>
> [0]: https://pulp.plan.io/issues/3570
>
> Thanks!
> Brian
>
>
>
> On Thu, Apr 12, 2018 at 7:12 PM, Jeff Ortel <jortel at redhat.com
> <mailto:jortel at redhat.com>> wrote:
>
>
>
> On 04/12/2018 04:00 PM, Brian Bouterse wrote:
>>
>> On Thu, Apr 12, 2018 at 11:53 AM, Jeff Ortel <jortel at redhat.com
>> <mailto:jortel at redhat.com>> wrote:
>>
>>
>>
>> On 04/12/2018 10:01 AM, Brian Bouterse wrote:
>>>
>>>
>>> On Wed, Apr 11, 2018 at 6:07 PM, Jeff Ortel
>>> <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>
>>>
>>>
>>> On 04/11/2018 03:29 PM, Brian Bouterse wrote:
>>>> I think we should look into this in the near-term.
>>>> Changing an interface on an object used by all plugins
>>>> will be significantly easier, earlier.
>>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 12:25 PM, Jeff Ortel
>>>> <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>>
>>>>
>>>>
>>>> On 04/11/2018 10:59 AM, Brian Bouterse wrote:
>>>>>
>>>>>
>>>>> On Tue, Apr 10, 2018 at 10:43 AM, Jeff Ortel
>>>>> <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 04/06/2018 09:15 AM, Brian Bouterse wrote:
>>>>>> Several plugins have started using the
>>>>>> Changesets including pulp_ansible,
>>>>>> pulp_python, pulp_file, and perhaps others.
>>>>>> The Changesets provide several distinct
>>>>>> points of value which are great, but there
>>>>>> are two challenges I want to bring up. I want
>>>>>> to focus only on the problem statements first.
>>>>>>
>>>>>> 1. There is redundant "differencing" code in
>>>>>> all plugins. The Changeset interface requires
>>>>>> the plugin writer to determine what units
>>>>>> need to be added and those to be removed.
>>>>>> This requires all plugin writers to write the
>>>>>> same non-trivial differencing code over and
>>>>>> over. For example, you can see the same
>>>>>> non-trivial differencing code present in
>>>>>> pulp_ansible
>>>>>> <https://github.com/pulp/pulp_ansible/blob/d0eb9d125f9a6cdc82e2807bcad38749967a1245/pulp_ansible/app/tasks/synchronizing.py#L217-L306>,
>>>>>> pulp_file
>>>>>> <https://github.com/pulp/pulp_file/blob/30afa7cce667b57d8fe66d5fc1fe87fd77029210/pulp_file/app/tasks/synchronizing.py#L114-L193>,
>>>>>> and pulp_python
>>>>>> <https://github.com/pulp/pulp_python/blob/066d33990e64b5781c8419b96acaf2acf1982324/pulp_python/app/tasks/sync.py#L172-L223>.
>>>>>> Line-wise, this "differencing" code makes up
>>>>>> a large portion (maybe 50%) of the sync code
>>>>>> itself in each plugin.
>>>>>
>>>>> Ten lines of trivial set logic hardly seems
>>>>> like a big deal but any duplication is worth
>>>>> exploring.
>>>>>
>>>>> It's more than ten lines. Take pulp_ansible for
>>>>> example. By my count (the linked to section) it's
>>>>> 89 lines, which out of 306 lines of plugin code
>>>>> for sync is 29% of extra redundant code. The other
>>>>> plugins have similar numbers. So with those
>>>>> numbers in mind, what do you think?
>>>>
>>>> I was counting the lines (w/o comments) in
>>>> find_delta() based on the linked code. Which
>>>> functions are you counting?
>>>>
>>>>
>>>> I was counting the find_delta, build_additions, and
>>>> build_removals methods. Regardless of how the lines are
>>>> counted, that differencing code is the duplication I'm
>>>> talking about. There isn't a way to use the changesets
>>>> without duplicating that differencing code in a plugin.
>>>
>>> The differencing code is limited to find_delta() and
>>> perhaps build_removals(). Agreed, the line count is
>>> less useful than specifically identifying duplicate
>>> code. Outside of find_delta(), I see similar code (in
>>> part because it got copied from file plugin) but not
>>> seeing actual duplication. Can you be more specific?
>>>
>>>
>>> Very similar code or identical code, I think it begs the
>>> question why are we having plugin writer's do this at all?
>>> What value are they creating with it? I don't have a
>>> reasonable answer to that question, so the requirement for
>>> plugin writer's to write that code brings me back to the
>>> problem statement: "plugin writers have redundant
>>> differencing code when using Changesets". More info on why
>>> it is valuable for the plugin writer to do the differencing
>>> code versus the Changesets would be helpful.
>>
>> The ChangeSet abstraction (and API) is based on following
>> division of responsibility:
>>
>> The plugin (with an understanding of the remote and its
>> content):
>> - Download metadata.
>> - Parse metadata
>> - Based on the metadata:
>> - determine content to be added to the repository.
>> - define how artifacts are downloaded.
>> - construct content
>> - determine content to be removed to the repository.
>>
>> Core (without understand of specific remote or its content):
>> - Provide low level API for plugin to affect the changes it
>> has determined need to be made to the repository. This is
>> downloaders, models etc.
>> - Provide high(er) level API for plugin to affect the
>> changes it has determined need to be made to the repository.
>> This is the ChangeSet.
>>
>> Are you proposing that this is not the correct division?
>>
>>
>> Yes I believe these problem statements suggest we should adjust
>> the plugin writer's responsibilities when interacting with the
>> Changesets in two specific ways. It's not exactly the language
>> you used, but I believe the following two responsibilities could
>> be moved into the Changesets entirely:
>>
>> - determining if any given Artifact or Content unit is already
>> present in Pulp (aka computing what needs tobe added)
>
> Did you mean /added/ to the repository or /created/ in pulp.
> Currently, the plugin determines the content that needs to be
> added to the repository. This is modeled using a PendingContent
> which fully defines the Content (unit) and its PendingArtifact(s)
> which are included in the /additions/. The ChangeSet does
> determine whether or not any artifacts need to be downloaded (and
> downloads them based on policy) and determines which Content needs
> to be /created/ vs simply added to the repository. The plugin
> blindly assumes that none of the /pending/ content has yet been
> created pulp. This accomplishes 2 things. 1) reduces complexity
> and decision making by the plugin. 2) provides the ChangeSet with
> all the information needed to /create/ and /download/ as needed.
> The /additions/ represents what the plugin wants to be added to
> the repository to synchronize it with the remote repository.
>
>> - determining which content units need to be removed (aka
>> computing the removals)
>
> I don't see how the ChangeSet has enough information to do this.
> The plugin can (most likely will) make the decision about what to
> remove based on remote metadata, policy and configuration.
>
>>
>> ^ goals are a restating of the problem statement that plugin
>> writers are asked to do differencing calculations when the
>> Changesets could provide that to the plugin writer instead.
>>
>>>
>>>>
>>>> So a shorter, simpler problem statement is: "to use the
>>>> changesets plugin writers have to do extra work to
>>>> compute additions and removals parameters".
>>>
>>> This statement ^ is better but still too vague to
>>> actually solve. Can we elaborate on specifically what
>>> "to do extra work" means?
>>>
>>>
>>> Sure. Removing that vague language is one way to resolve its
>>> vagueness. Here's a revised problem statement: "to use the
>>> changesets plugin writers have to compute additions and
>>> removals parameters". This problem statement would be
>>> resolved by a solution that causes the plugin writer to
>>> never have to produce these parameters and be replaced by an
>>> interface that would require less effort from a plugin writer.
>>
>> I think it's the plugin's responsibility to determine the
>> difference. Aside from that: without an understanding of the
>> metadata and content type, how could the ChangeSet do this?
>> What might that looks like?
>>
>>
>> If I'm understanding this correctly, the Changesets already do
>> this for additions right? Help check my understanding. If a
>> plugin writer delivers PendingContent and PendingArtifacts to the
>> Changesets as 'additions', the Changesets will recognize them as
>> already downloaded and not download them right? If this is the
>> case, what is the benefit of having plugin writers also try to
>> figure out if things should be downloaded?
>
> As you pointed out, the plugin writer does not need to figure out
> what needs to be downloaded.
>
>>
>>
>>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> 2. Plugins can't do end-to-end stream
>>>>>> processing. The Changesets themselves do
>>>>>> stream processing, but when you call into
>>>>>> changeset.apply_and_drain() you have to have
>>>>>> fully parsed the metadata already. Currently
>>>>>> when fetching all metadata from Galaxy,
>>>>>> pulp_ansible takes about 380 seconds (6+
>>>>>> min). This means that the actual Changeset
>>>>>> content downloading starts 380 seconds later
>>>>>> than it could. At the heart of the problem,
>>>>>> the fetching+parsing of the metadata is not
>>>>>> part of the stream processing.
>>>>>
>>>>> The additions/removals can be any interable
>>>>> (like generator) and by using
>>>>> ChangeSet.apply() and iterating the returned
>>>>> object, the pluign can "turn the crank" while
>>>>> downloading and processing the metadata. The
>>>>> ChangeSet.apply_and_drain() is just a
>>>>> convenience method. I don't see how this is a
>>>>> limitation of the ChangeSet.
>>>>>
>>>>>
>>>>> That is new info for me (and maybe everyone). OK
>>>>> so Changesets have two interfaces. apply() and
>>>>> apply_and_drain(). Why do we have two interfaces
>>>>> when apply() can support all existing use cases
>>>>> (that I know of) and do end-to-end stream
>>>>> processing but apply_and_drain() cannot? I see all
>>>>> of our examples (and all of our new plugins) using
>>>>> apply_and_drain().
>>>>
>>>> The ChangeSet.apply() was how I designed (and
>>>> documented) it. Not sure when/who added the
>>>> apply_and_drain(). +1 for removing it.
>>>>
>>>>
>>>> I read through the changeset docs. I think this stream
>>>> processing thing is still a problem but perhaps in how
>>>> we're presenting the Changeset with it's arguments. I
>>>> don't think apply() versus apply_and_drain() are at all
>>>> related. Regardless of if you are using apply() or
>>>> apply_and_drain(), the Changeset requires an
>>>> 'additions' and 'removals' arguments. This sends a
>>>> clear message to the plugin writer that they need to
>>>> compute additions and removals. They will fetch the
>>>> metadata to compute these which is mostly how the
>>>> changeset documentation reads. To know that they could
>>>> present a generator that would correctly allow the
>>>> metdata from inside the Changeset is I feel as
>>>> non-obvious. I want the high-performing implementation
>>>> to be the obvious one.
>>>>
>>>> So what about a problem statement like this:
>>>> "Changesets are presented such that when you call into
>>>> them you should already have fetched the metadata"?
>>>
>>> I'm not sure what is meant by "presented". If this
>>> means that we should provide an example of how the
>>> ChangeSet can be used by plugins (with large metadata)
>>> in such a way that does not require downloading all the
>>> metadata first - that sounds like a good idea.
>>>
>>>
>>> Cool so this is transitioning to ideas for resolution. The
>>> solution to add documentation on how to do this with the
>>> existing interface is one option. My concern with adding
>>> additional docs on how to use the current interface better
>>> is that if users choose to follow the existing docs then
>>> they will have the stream processing problem once again. To
>>> me, this suggests that this new example should actually
>>> replace the existing documentation.
>>
>> Seems like both example would be useful. I'm not convinced
>> that all plugins would benefit from this. For example: the
>> File plugin manifest is small and would likely not benefit
>> from the extra complexity. For complicated plugins (like
>> RPM), can differencing decision be made before analyzing the
>> entire metadata (eg: primary.xml)? Also, it's not clear to
>> me how this would work using the Downloader. Are you
>> suggesting that the plugin would parse/process metadata files
>> while they're being downloaded? Perhaps a better
>> understanding of the flow to be supported would help me
>> understand this.
>>
>>
>> Yes I am suggesting just that: that the Changesets could
>> facilitate parse/processing metadata files while actual content
>> named in those files is also being downloaded. I have a
>> straightforward idea on how to achieve this. It's short and easy
>> enough to write up (no code), but I want to make sure I'm not
>> moving beyond the problem statement without others. Is there more
>> we want to do on these problem statements, or would answering a
>> bit about one way it could work be helpful?
>
> The /additions/ can be (and usually is) a generator. The
> generator can yield based on metadata as it is downloaded and
> digested. In this way, the ChangeSet already facilitates this.
>
>>
>> Just to state my expectations: Moving beyond the problem
>> statement I don't consider to be a commitment to solve it; just
>> an agreement on what we're solving as we discuss various
>> resolutions. Problem statements can also always be revisited.
>> Either way forward is fine w/ me, just let me know how we should
>> continue
>
> So far, I'm not convinced that any specific problems/deficiencies
> have been identified. That said, you seem to have a different
> abstraction in mind. I would be interested in reviewing it and how
> it would be used by plugin writers. It may help illustrate the
> gains that you are envisioning.
>
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Do you see the same challenges I do? Are
>>>>>> these the right problem statements? I think
>>>>>> with clear problem statements a solution will
>>>>>> be easy to see and agree on.
>>>>>
>>>>> I'm not convinced that these are actual
>>>>> problems/challenges that need to be addressed
>>>>> in the near term.
>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Brian
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>
>>
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> https://www.redhat.com/mailman/listinfo/pulp-dev
> <https://www.redhat.com/mailman/listinfo/pulp-dev>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180416/d31508ca/attachment.htm>
More information about the Pulp-dev
mailing list