[Pulp-dev] Lazy for Pulp3

Wed Jun 13 17:05:35 UTC 2018

@ipanova, +1 to your names, I updated the epic.

FYI, I updated the epic in several ways to allow for the "cache_only"
option in the design.

I added a new task to add "policy" also to ContentUnit so the streamer can
know what to do:  https://pulp.plan.io/issues/3763

Other updates to allow for "cache_only":
https://pulp.plan.io/issues/3695#note-2
https://pulp.plan.io/issues/3699#note-3
https://pulp.plan.io/issues/3693

On Thu, Jun 7, 2018 at 5:10 AM, Ina Panova <ipanova at redhat.com> wrote:

> we could try to go with:
>
> policy=immediate  -> downloads now while the task runs (no lazy). Also the
> default if unspecified.
> policy=on_demand   -> All the steps in the diagram. Content that is
> downloaded is saved so that it's only ever downloaded once.
> policy=cache_only     -> All the steps in the diagram except step 14. If
> squid pushes the bits out of the cache, it will be re-downloaded again to
> serve to other clients requesting the same bits.
>
>
>
> --------
> Regards,
>
> Ina Panova
> Software Engineer| Pulp| Red Hat Inc.
>
> "Do not go where the path may lead,
>  go instead where there is no path and leave a trail."
>
> On Fri, Jun 1, 2018 at 12:36 AM, Jeff Ortel <jortel at redhat.com> wrote:
>
>>
>>
>> On 05/31/2018 04:39 PM, Brian Bouterse wrote:
>>
>> I updated the epic (https://pulp.plan.io/issues/3693) to use this new
>> language.
>>
>> policy=immediate  -> downloads now while the task runs (no lazy). Also
>> the default if unspecified.
>> policy=cache-and-save   -> All the steps in the diagram. Content that is
>> downloaded is saved so that it's only ever downloaded once.
>> policy=cache     -> All the steps in the diagram except step 14. If squid
>> pushes the bits out of the cache, it will be re-downloaded again to serve
>> to other clients requesting the same bits.
>>
>>
>> These policy names strike me as an odd, non-intuitive mixture. I think we
>> need to brainstorm on policy names and/or additional attributes to best
>> capture this.  Suggest the epic be updated to describe the "modes" or use
>> cases without the names for now.  I'll try to follow up with other
>> suggestions.
>>
>>
>>
>> Also @milan, see inline for answers to your question.
>>
>> On Wed, May 30, 2018 at 3:48 PM, Milan Kovacik <mkovacik at redhat.com>
>> wrote:
>>
>>> On Wed, May 30, 2018 at 4:50 PM, Brian Bouterse <bbouters at redhat.com>
>>> wrote:
>>> >
>>> >
>>> > On Wed, May 30, 2018 at 8:57 AM, Tom McKay <thomasmckay at redhat.com>
>>> wrote:
>>> >>
>>> >> I think there is a usecase for "proxy only" like is being described
>>> here.
>>> >> Several years ago there was a project called thumbslug[1] that was
>>> used in a
>>> >> version of katello instead of pulp. It's job was to check
>>> entitlements and
>>> >> then proxy content from a cdn. The same functionality could be
>>> implemented
>>> >> in pulp. (Perhaps it's even as simple as telling squid not to cache
>>> anything
>>> >> so the content would never make it from cache to pulp in current
>>> pulp-2.)
>>> >
>>> >
>>> > What would you call this policy?
>>> > policy=proxy?
>>> > policy=stream-dont-save?
>>> > policy=stream-no-save?
>>> >
>>> > Are the names 'on-demand' and 'immediate' clear enough? Are there
>>> better
>>> > names?
>>> >>
>>> >>
>>> >> Overall I'm +1 to the idea of an only-squid version, if others think
>>> it
>>> >> would be useful.
>>> >
>>> >
>>> > I understand describing this as a "only-squid" version, but for
>>> clarity, the
>>> > streamer would still be required because it is what requests the bits
>>> with
>>> > the correctly configured downloader (certs, proxy, etc). The streamer
>>> > streams the bits into squid which provides caching and client
>>> multiplexing.
>>>
>>> I have to admit it's just now I'm reading
>>> https://docs.pulpproject.org/dev-guide/design/deferred-downl
>>> oad.html#apache-reverse-proxy
>>> again because of the SSL termination. So the new plan is to use the
>>> streamer to terminate the SSL instead of the Apache reverse proxy?
>>>
>>
>> The plan for right now is to not use a reverse proxy and have the
>> client's connection terminate at squid directly either via http or https
>> depending on how squid is configured. The Reverse proxy in pulp2's design
>> served to validate the signed urls and rewrite them for squid. This first
>> implementation won't use signed urls. I believe that means we don't need a
>> reverse proxy here yet.
>>
>>
>>> W/r the construction of the URL of an artifact, I thought it would be
>>> stored in the DB, so the Remote would create it during the sync.
>>>
>>
>> This is correct. The inbound URL from the client after the redirect will
>> still be a reference that the "Pulp content app" will resolve to a
>> RemoteArtifact. Then the streamer will use that RemoteArtifact data to
>> correctly build the downloader. That's the gist of it at least.
>>
>>
>>> >
>>> > To confirm my understanding this "squid-only" policy would be the same
>>> as
>>> > on-demand except that it would *not* perform step 14 from the diagram
>>> here
>>> > (https://pulp.plan.io/issues/3693). Is that right?
>>> yup
>>> >
>>> >>
>>> >>
>>> >> [1] https://github.com/candlepin/thumbslug
>>> >>
>>> >> On Wed, May 30, 2018 at 8:34 AM, Milan Kovacik <mkovacik at redhat.com>
>>> >> wrote:
>>> >>>
>>> >>> On Tue, May 29, 2018 at 9:31 PM, Dennis Kliban <dkliban at redhat.com>
>>> >>> wrote:
>>> >>> > On Tue, May 29, 2018 at 11:42 AM, Milan Kovacik <
>>> mkovacik at redhat.com>
>>> >>> > wrote:
>>> >>> >>
>>> >>> >> On Tue, May 29, 2018 at 5:13 PM, Dennis Kliban <
>>> dkliban at redhat.com>
>>> >>> >> wrote:
>>> >>> >> > On Tue, May 29, 2018 at 10:41 AM, Milan Kovacik
>>> >>> >> > <mkovacik at redhat.com>
>>> >>> >> > wrote:
>>> >>> >> >>
>>> >>> >> >> Good point!
>>> >>> >> >> More the second; it might be a bit crazy to utilize Squid for
>>> that
>>> >>> >> >> but
>>> >>> >> >> first, let's answer the why ;)
>>> >>> >> >> So why does Pulp need to store the content here?
>>> >>> >> >> Why don't we point the users to the Squid all the time (for the
>>> >>> >> >> lazy
>>> >>> >> >> repos)?
>>> >>> >> >
>>> >>> >> >
>>> >>> >> > Pulp's Streamer needs to fetch and store the content because
>>> that's
>>> >>> >> > Pulp's
>>> >>> >> > primary responsibility.
>>> >>> >>
>>> >>> >> Maybe not that much the storing but rather the content views
>>> >>> >> management?
>>> >>> >> I mean the partitioning into repositories, promoting.
>>> >>> >>
>>> >>> >
>>> >>> > Exactly this. We want Pulp users to be able to reuse content that
>>> was
>>> >>> > brought in using the 'on_demand' download policy in other
>>> repositories.
>>> >>> I see.
>>> >>>
>>> >>> >
>>> >>> >>
>>> >>> >> If some of the content lived in Squid and some lived
>>> >>> >> > in Pulp, it would be difficult for the user to know what
>>> content is
>>> >>> >> > actually
>>> >>> >> > available in Pulp and what content needs to be fetched from a
>>> remote
>>> >>> >> > repository.
>>> >>> >>
>>> >>> >> I'd say the rule of the thumb would be: lazy -> squid, regular ->
>>> pulp
>>> >>> >> so not that difficult.
>>> >>> >> Maybe Pulp could have a concept of Origin, where folks upload
>>> stuff to
>>> >>> >> a Pulp repo, vs. Proxy for it's repo storage policy?
>>> >>> >>
>>> >>> >
>>> >>> > Squid removes things from the cache at some point. You can probably
>>> >>> > configure it to never remove anything from the cache, but then we
>>> would
>>> >>> > need
>>> >>> > to implement orphan cleanup that would work across two systems:
>>> pulp
>>> >>> > and
>>> >>> > squid.
>>> >>>
>>> >>> Actually "remote" units wouldn't need orphan cleaning from the disk,
>>> >>> just dropping them from the DB would suffice.
>>> >>>
>>> >>> >
>>> >>> > Answering that question would still be difficult. Not all content
>>> that
>>> >>> > is in
>>> >>> > the repository that was synced using on_demand download policy
>>> will be
>>> >>> > in
>>> >>> > Squid - only the content that has been requested by clients. So
>>> it's
>>> >>> > still
>>> >>> > hard to know which of the content units have been downloaded and
>>> which
>>> >>> > have
>>> >>> > not been.
>>> >>>
>>> >>> But the beauty is exactly in that: we don't have to track whether the
>>> >>> content is downloaded if it is reverse-proxied[1][2].
>>> >>> Moreover, this would work both with and without a proxy between Pulp
>>> >>> and the Origin of the remote unit.
>>> >>> A "remote" content artifact might just need to carry it's URL in a DB
>>> >>> column for this to work; so the async artifact model, instead of the
>>> >>> "policy=on-demand"  would have a mandatory remote "URL" attribute; I
>>> >>> wouldn't say it's more complex than tracking the "policy" attribute.
>>> >>>
>>> >>> >
>>> >>> >
>>> >>> >>
>>> >>> >> >
>>> >>> >> > As Pulp downloads an Artifact, it calculates all the checksums
>>> and
>>> >>> >> > it's
>>> >>> >> > size. It then performs validation based on information that was
>>> >>> >> > provided
>>> >>> >> > from the RemoteArtifact. After validation is performed, the
>>> >>> >> > Artifact, is
>>> >>> >> > saved to the database and it's final place in
>>> >>> >> > /var/lib/content/artifacts/.
>>> >>> >>
>>> >>> >> This could be still achieved by storing the content just
>>> temporarily
>>> >>> >> in the Squid proxy i.e use Squid as the content source, not the
>>> disk.
>>> >>> >>
>>> >>> >> > Once this information is in the database, Pulp's web server can
>>> >>> >> > serve
>>> >>> >> > the
>>> >>> >> > content without having to involve the Streamer or Squid.
>>> >>> >>
>>> >>> >> Pulp might serve just the API and the metadata, the content might
>>> be
>>> >>> >> redirected to the Proxy all the time, correct?
>>> >>> >> Doesn't Crane do that btw?
>>> >>> >
>>> >>> >
>>> >>> > Theoretically we could do this, but in practice we would run into
>>> >>> > problems
>>> >>> > when we needed to scale out the Content app. Right now when the
>>> Content
>>> >>> > app
>>> >>> > needs to be scaled, a user can launch another machine that will
>>> run the
>>> >>> > Content app. Squid does not support that kind of scaling. Squid can
>>> >>> > only
>>> >>> > take advantage of additional cores in a single machine
>>> >>>
>>> >>> I don't think I understand; proxies are actually designed to scale[1]
>>> >>> and are used as tools to scale the web too.
>>> >>>
>>> >>> This is all about the How question but when it comes to my original
>>> >>> Why, please correct me if I'm being wrong, the answer so far has
>>> been:
>>> >>>  Pulp always downloads the content because that's what it is
>>> supposed to
>>> >>> do.
>>> >>>
>>> >>> Cheers,
>>> >>> milan
>>> >>>
>>> >>> [1] https://en.wikipedia.org/wiki/Reverse_proxy
>>> >>> [2] https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA
>>> >>> [3]
>>> >>> https://wiki.squid-cache.org/Features/CacheHierarchy?highlig
>>> ht=%28faqlisted.yes%29
>>> >>>
>>> >>> >
>>> >>> >>
>>> >>> >>
>>> >>> >> Cheers,
>>> >>> >> milan
>>> >>> >>
>>> >>> >> >
>>> >>> >> > -dennis
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >
>>> >>> >> >>
>>> >>> >> >>
>>> >>> >> >> --
>>> >>> >> >> cheers
>>> >>> >> >> milan
>>> >>> >> >>
>>> >>> >> >> On Tue, May 29, 2018 at 4:25 PM, Brian Bouterse
>>> >>> >> >> <bbouters at redhat.com>
>>> >>> >> >> wrote:
>>> >>> >> >> >
>>> >>> >> >> > On Mon, May 28, 2018 at 9:57 AM, Milan Kovacik
>>> >>> >> >> > <mkovacik at redhat.com>
>>> >>> >> >> > wrote:
>>> >>> >> >> >>
>>> >>> >> >> >> Hi,
>>> >>> >> >> >>
>>> >>> >> >> >> Looking at the diagram[1] I'm wondering what's the reasoning
>>> >>> >> >> >> behind
>>> >>> >> >> >> Pulp having to actually fetch the content locally?
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >> > Is the question "why is Pulp doing the fetching and not
>>> Squid?"
>>> >>> >> >> > or
>>> >>> >> >> > "why
>>> >>> >> >> > is
>>> >>> >> >> > Pulp storing the content after fetching it?" or both?
>>> >>> >> >> >
>>> >>> >> >> >> Couldn't Pulp just rely on the proxy with regards to the
>>> content
>>> >>> >> >> >> streaming?
>>> >>> >> >> >>
>>> >>> >> >> >> Thanks,
>>> >>> >> >> >> milan
>>> >>> >> >> >>
>>> >>> >> >> >>
>>> >>> >> >> >> [1] https://pulp.plan.io/attachments/130957
>>> >>> >> >> >>
>>> >>> >> >> >> On Fri, May 25, 2018 at 9:11 PM, Brian Bouterse
>>> >>> >> >> >> <bbouters at redhat.com>
>>> >>> >> >> >> wrote:
>>> >>> >> >> >> > A mini-team of core devs** met to talk through lazy use
>>> cases
>>> >>> >> >> >> > for
>>> >>> >> >> >> > Pulp3.
>>> >>> >> >> >> > It's effectively the same lazy from Pulp2 except:
>>> >>> >> >> >> >
>>> >>> >> >> >> > * it's now built into core (not just RPM)
>>> >>> >> >> >> > * It disincludes repo protection use cases because we
>>> haven't
>>> >>> >> >> >> > added
>>> >>> >> >> >> > repo
>>> >>> >> >> >> > protection to Pulp3 yet
>>> >>> >> >> >> > * It disincludes the "background" policy which based on
>>> >>> >> >> >> > feedback
>>> >>> >> >> >> > from
>>> >>> >> >> >> > stakeholders provided very little value
>>> >>> >> >> >> > * it will no longer will depend on Twisted as a
>>> dependency. It
>>> >>> >> >> >> > will
>>> >>> >> >> >> > use
>>> >>> >> >> >> > asyncio instead.
>>> >>> >> >> >> >
>>> >>> >> >> >> > While it is being built into core, it will require minimal
>>> >>> >> >> >> > support
>>> >>> >> >> >> > by
>>> >>> >> >> >> > a
>>> >>> >> >> >> > plugin writer to add support for it. Details in the epic
>>> >>> >> >> >> > below.
>>> >>> >> >> >> >
>>> >>> >> >> >> > The current use cases along with a technical plan are
>>> written
>>> >>> >> >> >> > on
>>> >>> >> >> >> > this
>>> >>> >> >> >> > epic:
>>> >>> >> >> >> > https://pulp.plan.io/issues/3693
>>> >>> >> >> >> >
>>> >>> >> >> >> > We're putting it out for comment, questions, and feedback
>>> >>> >> >> >> > before
>>> >>> >> >> >> > we
>>> >>> >> >> >> > start
>>> >>> >> >> >> > into the code. I hope we are able to add this into our
>>> next
>>> >>> >> >> >> > sprint.
>>> >>> >> >> >> >
>>> >>> >> >> >> > ** ipanova, jortel, ttereshc, dkliban, bmbouter
>>> >>> >> >> >> >
>>> >>> >> >> >> > Thanks!
>>> >>> >> >> >> > Brian
>>> >>> >> >> >> >
>>> >>> >> >> >> >
>>> >>> >> >> >> > _______________________________________________
>>> >>> >> >> >> > Pulp-dev mailing list
>>> >>> >> >> >> > Pulp-dev at redhat.com
>>> >>> >> >> >> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >>> >> >> >> >
>>> >>> >> >> >
>>> >>> >> >> >
>>> >>> >> >>
>>> >>> >> >> _______________________________________________
>>> >>> >> >> Pulp-dev mailing list
>>> >>> >> >> Pulp-dev at redhat.com
>>> >>> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >>> >> >
>>> >>> >> >
>>> >>> >
>>> >>> >
>>> >>>
>>> >>> _______________________________________________
>>> >>> Pulp-dev mailing list
>>> >>> Pulp-dev at redhat.com
>>> >>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Pulp-dev mailing list
>>> >> Pulp-dev at redhat.com
>>> >> https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >>
>>> >
>>>
>>
>>
>>
>> _______________________________________________
>> Pulp-dev mailing listPulp-dev at redhat.comhttps://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180613/0440ae16/attachment.htm>