[Pulp-dev] Lazy for Pulp3

Mon Jun 4 16:43:36 UTC 2018

On Thu, May 31, 2018 at 11:39 PM, Brian Bouterse <bbouters at redhat.com> wrote:
> I updated the epic (https://pulp.plan.io/issues/3693) to use this new
> language.
>
> policy=immediate  -> downloads now while the task runs (no lazy). Also the
> default if unspecified.
> policy=cache-and-save   -> All the steps in the diagram. Content that is
> downloaded is saved so that it's only ever downloaded once.
> policy=cache     -> All the steps in the diagram except step 14. If squid
> pushes the bits out of the cache, it will be re-downloaded again to serve to
> other clients requesting the same bits.
>
> Also @milan, see inline for answers to your question.
>
> On Wed, May 30, 2018 at 3:48 PM, Milan Kovacik <mkovacik at redhat.com> wrote:
>>
>> On Wed, May 30, 2018 at 4:50 PM, Brian Bouterse <bbouters at redhat.com>
>> wrote:
>> >
>> >
>> > On Wed, May 30, 2018 at 8:57 AM, Tom McKay <thomasmckay at redhat.com>
>> > wrote:
>> >>
>> >> I think there is a usecase for "proxy only" like is being described
>> >> here.
>> >> Several years ago there was a project called thumbslug[1] that was used
>> >> in a
>> >> version of katello instead of pulp. It's job was to check entitlements
>> >> and
>> >> then proxy content from a cdn. The same functionality could be
>> >> implemented
>> >> in pulp. (Perhaps it's even as simple as telling squid not to cache
>> >> anything
>> >> so the content would never make it from cache to pulp in current
>> >> pulp-2.)
>> >
>> >
>> > What would you call this policy?
>> > policy=proxy?
>> > policy=stream-dont-save?
>> > policy=stream-no-save?
>> >
>> > Are the names 'on-demand' and 'immediate' clear enough? Are there better
>> > names?
>> >>
>> >>
>> >> Overall I'm +1 to the idea of an only-squid version, if others think it
>> >> would be useful.
>> >
>> >
>> > I understand describing this as a "only-squid" version, but for clarity,
>> > the
>> > streamer would still be required because it is what requests the bits
>> > with
>> > the correctly configured downloader (certs, proxy, etc). The streamer
>> > streams the bits into squid which provides caching and client
>> > multiplexing.
>>
>> I have to admit it's just now I'm reading
>>
>> https://docs.pulpproject.org/dev-guide/design/deferred-download.html#apache-reverse-proxy
>> again because of the SSL termination. So the new plan is to use the
>> streamer to terminate the SSL instead of the Apache reverse proxy?
>
>
> The plan for right now is to not use a reverse proxy and have the client's
> connection terminate at squid directly either via http or https depending on
> how squid is configured. The Reverse proxy in pulp2's design served to
> validate the signed urls and rewrite them for squid. This first
> implementation won't use signed urls. I believe that means we don't need a
> reverse proxy here yet.

I don't think I understand; so Squid will be used to terminate TLS but
it won't be used as a reverse proxy?

>
>>
>> W/r the construction of the URL of an artifact, I thought it would be
>> stored in the DB, so the Remote would create it during the sync.
>
>
> This is correct. The inbound URL from the client after the redirect will
> still be a reference that the "Pulp content app" will resolve to a
> RemoteArtifact. Then the streamer will use that RemoteArtifact data to
> correctly build the downloader. That's the gist of it at least.

>
>>
>> >
>> > To confirm my understanding this "squid-only" policy would be the same
>> > as
>> > on-demand except that it would *not* perform step 14 from the diagram
>> > here
>> > (https://pulp.plan.io/issues/3693). Is that right?
>> yup
>> >
>> >>
>> >>
>> >> [1] https://github.com/candlepin/thumbslug
>> >>
>> >> On Wed, May 30, 2018 at 8:34 AM, Milan Kovacik <mkovacik at redhat.com>
>> >> wrote:
>> >>>
>> >>> On Tue, May 29, 2018 at 9:31 PM, Dennis Kliban <dkliban at redhat.com>
>> >>> wrote:
>> >>> > On Tue, May 29, 2018 at 11:42 AM, Milan Kovacik
>> >>> > <mkovacik at redhat.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> On Tue, May 29, 2018 at 5:13 PM, Dennis Kliban <dkliban at redhat.com>
>> >>> >> wrote:
>> >>> >> > On Tue, May 29, 2018 at 10:41 AM, Milan Kovacik
>> >>> >> > <mkovacik at redhat.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Good point!
>> >>> >> >> More the second; it might be a bit crazy to utilize Squid for
>> >>> >> >> that
>> >>> >> >> but
>> >>> >> >> first, let's answer the why ;)
>> >>> >> >> So why does Pulp need to store the content here?
>> >>> >> >> Why don't we point the users to the Squid all the time (for the
>> >>> >> >> lazy
>> >>> >> >> repos)?
>> >>> >> >
>> >>> >> >
>> >>> >> > Pulp's Streamer needs to fetch and store the content because
>> >>> >> > that's
>> >>> >> > Pulp's
>> >>> >> > primary responsibility.
>> >>> >>
>> >>> >> Maybe not that much the storing but rather the content views
>> >>> >> management?
>> >>> >> I mean the partitioning into repositories, promoting.
>> >>> >>
>> >>> >
>> >>> > Exactly this. We want Pulp users to be able to reuse content that
>> >>> > was
>> >>> > brought in using the 'on_demand' download policy in other
>> >>> > repositories.
>> >>> I see.
>> >>>
>> >>> >
>> >>> >>
>> >>> >> If some of the content lived in Squid and some lived
>> >>> >> > in Pulp, it would be difficult for the user to know what content
>> >>> >> > is
>> >>> >> > actually
>> >>> >> > available in Pulp and what content needs to be fetched from a
>> >>> >> > remote
>> >>> >> > repository.
>> >>> >>
>> >>> >> I'd say the rule of the thumb would be: lazy -> squid, regular ->
>> >>> >> pulp
>> >>> >> so not that difficult.
>> >>> >> Maybe Pulp could have a concept of Origin, where folks upload stuff
>> >>> >> to
>> >>> >> a Pulp repo, vs. Proxy for it's repo storage policy?
>> >>> >>
>> >>> >
>> >>> > Squid removes things from the cache at some point. You can probably
>> >>> > configure it to never remove anything from the cache, but then we
>> >>> > would
>> >>> > need
>> >>> > to implement orphan cleanup that would work across two systems: pulp
>> >>> > and
>> >>> > squid.
>> >>>
>> >>> Actually "remote" units wouldn't need orphan cleaning from the disk,
>> >>> just dropping them from the DB would suffice.
>> >>>
>> >>> >
>> >>> > Answering that question would still be difficult. Not all content
>> >>> > that
>> >>> > is in
>> >>> > the repository that was synced using on_demand download policy will
>> >>> > be
>> >>> > in
>> >>> > Squid - only the content that has been requested by clients. So it's
>> >>> > still
>> >>> > hard to know which of the content units have been downloaded and
>> >>> > which
>> >>> > have
>> >>> > not been.
>> >>>
>> >>> But the beauty is exactly in that: we don't have to track whether the
>> >>> content is downloaded if it is reverse-proxied[1][2].
>> >>> Moreover, this would work both with and without a proxy between Pulp
>> >>> and the Origin of the remote unit.
>> >>> A "remote" content artifact might just need to carry it's URL in a DB
>> >>> column for this to work; so the async artifact model, instead of the
>> >>> "policy=on-demand"  would have a mandatory remote "URL" attribute; I
>> >>> wouldn't say it's more complex than tracking the "policy" attribute.
>> >>>
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> >
>> >>> >> > As Pulp downloads an Artifact, it calculates all the checksums
>> >>> >> > and
>> >>> >> > it's
>> >>> >> > size. It then performs validation based on information that was
>> >>> >> > provided
>> >>> >> > from the RemoteArtifact. After validation is performed, the
>> >>> >> > Artifact, is
>> >>> >> > saved to the database and it's final place in
>> >>> >> > /var/lib/content/artifacts/.
>> >>> >>
>> >>> >> This could be still achieved by storing the content just
>> >>> >> temporarily
>> >>> >> in the Squid proxy i.e use Squid as the content source, not the
>> >>> >> disk.
>> >>> >>
>> >>> >> > Once this information is in the database, Pulp's web server can
>> >>> >> > serve
>> >>> >> > the
>> >>> >> > content without having to involve the Streamer or Squid.
>> >>> >>
>> >>> >> Pulp might serve just the API and the metadata, the content might
>> >>> >> be
>> >>> >> redirected to the Proxy all the time, correct?
>> >>> >> Doesn't Crane do that btw?
>> >>> >
>> >>> >
>> >>> > Theoretically we could do this, but in practice we would run into
>> >>> > problems
>> >>> > when we needed to scale out the Content app. Right now when the
>> >>> > Content
>> >>> > app
>> >>> > needs to be scaled, a user can launch another machine that will run
>> >>> > the
>> >>> > Content app. Squid does not support that kind of scaling. Squid can
>> >>> > only
>> >>> > take advantage of additional cores in a single machine
>> >>>
>> >>> I don't think I understand; proxies are actually designed to scale[1]
>> >>> and are used as tools to scale the web too.
>> >>>
>> >>> This is all about the How question but when it comes to my original
>> >>> Why, please correct me if I'm being wrong, the answer so far has been:
>> >>>  Pulp always downloads the content because that's what it is supposed
>> >>> to
>> >>> do.
>> >>>
>> >>> Cheers,
>> >>> milan
>> >>>
>> >>> [1] https://en.wikipedia.org/wiki/Reverse_proxy
>> >>> [2] https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA
>> >>> [3]
>> >>>
>> >>> https://wiki.squid-cache.org/Features/CacheHierarchy?highlight=%28faqlisted.yes%29
>> >>>
>> >>> >
>> >>> >>
>> >>> >>
>> >>> >> Cheers,
>> >>> >> milan
>> >>> >>
>> >>> >> >
>> >>> >> > -dennis
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> cheers
>> >>> >> >> milan
>> >>> >> >>
>> >>> >> >> On Tue, May 29, 2018 at 4:25 PM, Brian Bouterse
>> >>> >> >> <bbouters at redhat.com>
>> >>> >> >> wrote:
>> >>> >> >> >
>> >>> >> >> > On Mon, May 28, 2018 at 9:57 AM, Milan Kovacik
>> >>> >> >> > <mkovacik at redhat.com>
>> >>> >> >> > wrote:
>> >>> >> >> >>
>> >>> >> >> >> Hi,
>> >>> >> >> >>
>> >>> >> >> >> Looking at the diagram[1] I'm wondering what's the reasoning
>> >>> >> >> >> behind
>> >>> >> >> >> Pulp having to actually fetch the content locally?
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > Is the question "why is Pulp doing the fetching and not
>> >>> >> >> > Squid?"
>> >>> >> >> > or
>> >>> >> >> > "why
>> >>> >> >> > is
>> >>> >> >> > Pulp storing the content after fetching it?" or both?
>> >>> >> >> >
>> >>> >> >> >> Couldn't Pulp just rely on the proxy with regards to the
>> >>> >> >> >> content
>> >>> >> >> >> streaming?
>> >>> >> >> >>
>> >>> >> >> >> Thanks,
>> >>> >> >> >> milan
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> [1] https://pulp.plan.io/attachments/130957
>> >>> >> >> >>
>> >>> >> >> >> On Fri, May 25, 2018 at 9:11 PM, Brian Bouterse
>> >>> >> >> >> <bbouters at redhat.com>
>> >>> >> >> >> wrote:
>> >>> >> >> >> > A mini-team of core devs** met to talk through lazy use
>> >>> >> >> >> > cases
>> >>> >> >> >> > for
>> >>> >> >> >> > Pulp3.
>> >>> >> >> >> > It's effectively the same lazy from Pulp2 except:
>> >>> >> >> >> >
>> >>> >> >> >> > * it's now built into core (not just RPM)
>> >>> >> >> >> > * It disincludes repo protection use cases because we
>> >>> >> >> >> > haven't
>> >>> >> >> >> > added
>> >>> >> >> >> > repo
>> >>> >> >> >> > protection to Pulp3 yet
>> >>> >> >> >> > * It disincludes the "background" policy which based on
>> >>> >> >> >> > feedback
>> >>> >> >> >> > from
>> >>> >> >> >> > stakeholders provided very little value
>> >>> >> >> >> > * it will no longer will depend on Twisted as a dependency.
>> >>> >> >> >> > It
>> >>> >> >> >> > will
>> >>> >> >> >> > use
>> >>> >> >> >> > asyncio instead.
>> >>> >> >> >> >
>> >>> >> >> >> > While it is being built into core, it will require minimal
>> >>> >> >> >> > support
>> >>> >> >> >> > by
>> >>> >> >> >> > a
>> >>> >> >> >> > plugin writer to add support for it. Details in the epic
>> >>> >> >> >> > below.
>> >>> >> >> >> >
>> >>> >> >> >> > The current use cases along with a technical plan are
>> >>> >> >> >> > written
>> >>> >> >> >> > on
>> >>> >> >> >> > this
>> >>> >> >> >> > epic:
>> >>> >> >> >> > https://pulp.plan.io/issues/3693
>> >>> >> >> >> >
>> >>> >> >> >> > We're putting it out for comment, questions, and feedback
>> >>> >> >> >> > before
>> >>> >> >> >> > we
>> >>> >> >> >> > start
>> >>> >> >> >> > into the code. I hope we are able to add this into our next
>> >>> >> >> >> > sprint.
>> >>> >> >> >> >
>> >>> >> >> >> > ** ipanova, jortel, ttereshc, dkliban, bmbouter
>> >>> >> >> >> >
>> >>> >> >> >> > Thanks!
>> >>> >> >> >> > Brian
>> >>> >> >> >> >
>> >>> >> >> >> >
>> >>> >> >> >> > _______________________________________________
>> >>> >> >> >> > Pulp-dev mailing list
>> >>> >> >> >> > Pulp-dev at redhat.com
>> >>> >> >> >> > https://www.redhat.com/mailman/listinfo/pulp-dev
>> >>> >> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >> _______________________________________________
>> >>> >> >> Pulp-dev mailing list
>> >>> >> >> Pulp-dev at redhat.com
>> >>> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
>> >>> >> >
>> >>> >> >
>> >>> >
>> >>> >
>> >>>
>> >>> _______________________________________________
>> >>> Pulp-dev mailing list
>> >>> Pulp-dev at redhat.com
>> >>> https://www.redhat.com/mailman/listinfo/pulp-dev
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Pulp-dev mailing list
>> >> Pulp-dev at redhat.com
>> >> https://www.redhat.com/mailman/listinfo/pulp-dev
>> >>
>> >
>
>