[Pulp-dev] Lazy for Pulp3

Wed Jun 6 17:09:23 UTC 2018

The config we used in pulp2 can be seen here:
https://docs.pulpproject.org/user-guide/deferred-download.html#squid

In that scenario we used a reverse proxy to do TLS termination, but I think
squid will do the TLS termination in this case. We haven't configured squid
like that before so we'll have to make sure (a) it can and (b) that it will
cache data that flows over that TLS link also.

If we can't have squid do the TLS termination, we'll need a reverse proxy
to do it like we did in pulp2.

On Mon, Jun 4, 2018 at 12:43 PM, Milan Kovacik <mkovacik at redhat.com> wrote:

> On Thu, May 31, 2018 at 11:39 PM, Brian Bouterse <bbouters at redhat.com>
> wrote:
> > I updated the epic (https://pulp.plan.io/issues/3693) to use this new
> > language.
> >
> > policy=immediate  -> downloads now while the task runs (no lazy). Also
> the
> > default if unspecified.
> > policy=cache-and-save   -> All the steps in the diagram. Content that is
> > downloaded is saved so that it's only ever downloaded once.
> > policy=cache     -> All the steps in the diagram except step 14. If squid
> > pushes the bits out of the cache, it will be re-downloaded again to
> serve to
> > other clients requesting the same bits.
> >
> > Also @milan, see inline for answers to your question.
> >
> > On Wed, May 30, 2018 at 3:48 PM, Milan Kovacik <mkovacik at redhat.com>
> wrote:
> >>
> >> On Wed, May 30, 2018 at 4:50 PM, Brian Bouterse <bbouters at redhat.com>
> >> wrote:
> >> >
> >> >
> >> > On Wed, May 30, 2018 at 8:57 AM, Tom McKay <thomasmckay at redhat.com>
> >> > wrote:
> >> >>
> >> >> I think there is a usecase for "proxy only" like is being described
> >> >> here.
> >> >> Several years ago there was a project called thumbslug[1] that was
> used
> >> >> in a
> >> >> version of katello instead of pulp. It's job was to check
> entitlements
> >> >> and
> >> >> then proxy content from a cdn. The same functionality could be
> >> >> implemented
> >> >> in pulp. (Perhaps it's even as simple as telling squid not to cache
> >> >> anything
> >> >> so the content would never make it from cache to pulp in current
> >> >> pulp-2.)
> >> >
> >> >
> >> > What would you call this policy?
> >> > policy=proxy?
> >> > policy=stream-dont-save?
> >> > policy=stream-no-save?
> >> >
> >> > Are the names 'on-demand' and 'immediate' clear enough? Are there
> better
> >> > names?
> >> >>
> >> >>
> >> >> Overall I'm +1 to the idea of an only-squid version, if others think
> it
> >> >> would be useful.
> >> >
> >> >
> >> > I understand describing this as a "only-squid" version, but for
> clarity,
> >> > the
> >> > streamer would still be required because it is what requests the bits
> >> > with
> >> > the correctly configured downloader (certs, proxy, etc). The streamer
> >> > streams the bits into squid which provides caching and client
> >> > multiplexing.
> >>
> >> I have to admit it's just now I'm reading
> >>
> >> https://docs.pulpproject.org/dev-guide/design/deferred-
> download.html#apache-reverse-proxy
> >> again because of the SSL termination. So the new plan is to use the
> >> streamer to terminate the SSL instead of the Apache reverse proxy?
> >
> >
> > The plan for right now is to not use a reverse proxy and have the
> client's
> > connection terminate at squid directly either via http or https
> depending on
> > how squid is configured. The Reverse proxy in pulp2's design served to
> > validate the signed urls and rewrite them for squid. This first
> > implementation won't use signed urls. I believe that means we don't need
> a
> > reverse proxy here yet.
>
> I don't think I understand; so Squid will be used to terminate TLS but
> it won't be used as a reverse proxy?
>
>
>
>
> >
> >>
> >> W/r the construction of the URL of an artifact, I thought it would be
> >> stored in the DB, so the Remote would create it during the sync.
> >
> >
> > This is correct. The inbound URL from the client after the redirect will
> > still be a reference that the "Pulp content app" will resolve to a
> > RemoteArtifact. Then the streamer will use that RemoteArtifact data to
> > correctly build the downloader. That's the gist of it at least.
>
>
>
> >
> >>
> >> >
> >> > To confirm my understanding this "squid-only" policy would be the same
> >> > as
> >> > on-demand except that it would *not* perform step 14 from the diagram
> >> > here
> >> > (https://pulp.plan.io/issues/3693). Is that right?
> >> yup
> >> >
> >> >>
> >> >>
> >> >> [1] https://github.com/candlepin/thumbslug
> >> >>
> >> >> On Wed, May 30, 2018 at 8:34 AM, Milan Kovacik <mkovacik at redhat.com>
> >> >> wrote:
> >> >>>
> >> >>> On Tue, May 29, 2018 at 9:31 PM, Dennis Kliban <dkliban at redhat.com>
> >> >>> wrote:
> >> >>> > On Tue, May 29, 2018 at 11:42 AM, Milan Kovacik
> >> >>> > <mkovacik at redhat.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> On Tue, May 29, 2018 at 5:13 PM, Dennis Kliban <
> dkliban at redhat.com>
> >> >>> >> wrote:
> >> >>> >> > On Tue, May 29, 2018 at 10:41 AM, Milan Kovacik
> >> >>> >> > <mkovacik at redhat.com>
> >> >>> >> > wrote:
> >> >>> >> >>
> >> >>> >> >> Good point!
> >> >>> >> >> More the second; it might be a bit crazy to utilize Squid for
> >> >>> >> >> that
> >> >>> >> >> but
> >> >>> >> >> first, let's answer the why ;)
> >> >>> >> >> So why does Pulp need to store the content here?
> >> >>> >> >> Why don't we point the users to the Squid all the time (for
> the
> >> >>> >> >> lazy
> >> >>> >> >> repos)?
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > Pulp's Streamer needs to fetch and store the content because
> >> >>> >> > that's
> >> >>> >> > Pulp's
> >> >>> >> > primary responsibility.
> >> >>> >>
> >> >>> >> Maybe not that much the storing but rather the content views
> >> >>> >> management?
> >> >>> >> I mean the partitioning into repositories, promoting.
> >> >>> >>
> >> >>> >
> >> >>> > Exactly this. We want Pulp users to be able to reuse content that
> >> >>> > was
> >> >>> > brought in using the 'on_demand' download policy in other
> >> >>> > repositories.
> >> >>> I see.
> >> >>>
> >> >>> >
> >> >>> >>
> >> >>> >> If some of the content lived in Squid and some lived
> >> >>> >> > in Pulp, it would be difficult for the user to know what
> content
> >> >>> >> > is
> >> >>> >> > actually
> >> >>> >> > available in Pulp and what content needs to be fetched from a
> >> >>> >> > remote
> >> >>> >> > repository.
> >> >>> >>
> >> >>> >> I'd say the rule of the thumb would be: lazy -> squid, regular ->
> >> >>> >> pulp
> >> >>> >> so not that difficult.
> >> >>> >> Maybe Pulp could have a concept of Origin, where folks upload
> stuff
> >> >>> >> to
> >> >>> >> a Pulp repo, vs. Proxy for it's repo storage policy?
> >> >>> >>
> >> >>> >
> >> >>> > Squid removes things from the cache at some point. You can
> probably
> >> >>> > configure it to never remove anything from the cache, but then we
> >> >>> > would
> >> >>> > need
> >> >>> > to implement orphan cleanup that would work across two systems:
> pulp
> >> >>> > and
> >> >>> > squid.
> >> >>>
> >> >>> Actually "remote" units wouldn't need orphan cleaning from the disk,
> >> >>> just dropping them from the DB would suffice.
> >> >>>
> >> >>> >
> >> >>> > Answering that question would still be difficult. Not all content
> >> >>> > that
> >> >>> > is in
> >> >>> > the repository that was synced using on_demand download policy
> will
> >> >>> > be
> >> >>> > in
> >> >>> > Squid - only the content that has been requested by clients. So
> it's
> >> >>> > still
> >> >>> > hard to know which of the content units have been downloaded and
> >> >>> > which
> >> >>> > have
> >> >>> > not been.
> >> >>>
> >> >>> But the beauty is exactly in that: we don't have to track whether
> the
> >> >>> content is downloaded if it is reverse-proxied[1][2].
> >> >>> Moreover, this would work both with and without a proxy between Pulp
> >> >>> and the Origin of the remote unit.
> >> >>> A "remote" content artifact might just need to carry it's URL in a
> DB
> >> >>> column for this to work; so the async artifact model, instead of the
> >> >>> "policy=on-demand"  would have a mandatory remote "URL" attribute; I
> >> >>> wouldn't say it's more complex than tracking the "policy" attribute.
> >> >>>
> >> >>> >
> >> >>> >
> >> >>> >>
> >> >>> >> >
> >> >>> >> > As Pulp downloads an Artifact, it calculates all the checksums
> >> >>> >> > and
> >> >>> >> > it's
> >> >>> >> > size. It then performs validation based on information that was
> >> >>> >> > provided
> >> >>> >> > from the RemoteArtifact. After validation is performed, the
> >> >>> >> > Artifact, is
> >> >>> >> > saved to the database and it's final place in
> >> >>> >> > /var/lib/content/artifacts/.
> >> >>> >>
> >> >>> >> This could be still achieved by storing the content just
> >> >>> >> temporarily
> >> >>> >> in the Squid proxy i.e use Squid as the content source, not the
> >> >>> >> disk.
> >> >>> >>
> >> >>> >> > Once this information is in the database, Pulp's web server can
> >> >>> >> > serve
> >> >>> >> > the
> >> >>> >> > content without having to involve the Streamer or Squid.
> >> >>> >>
> >> >>> >> Pulp might serve just the API and the metadata, the content might
> >> >>> >> be
> >> >>> >> redirected to the Proxy all the time, correct?
> >> >>> >> Doesn't Crane do that btw?
> >> >>> >
> >> >>> >
> >> >>> > Theoretically we could do this, but in practice we would run into
> >> >>> > problems
> >> >>> > when we needed to scale out the Content app. Right now when the
> >> >>> > Content
> >> >>> > app
> >> >>> > needs to be scaled, a user can launch another machine that will
> run
> >> >>> > the
> >> >>> > Content app. Squid does not support that kind of scaling. Squid
> can
> >> >>> > only
> >> >>> > take advantage of additional cores in a single machine
> >> >>>
> >> >>> I don't think I understand; proxies are actually designed to
> scale[1]
> >> >>> and are used as tools to scale the web too.
> >> >>>
> >> >>> This is all about the How question but when it comes to my original
> >> >>> Why, please correct me if I'm being wrong, the answer so far has
> been:
> >> >>>  Pulp always downloads the content because that's what it is
> supposed
> >> >>> to
> >> >>> do.
> >> >>>
> >> >>> Cheers,
> >> >>> milan
> >> >>>
> >> >>> [1] https://en.wikipedia.org/wiki/Reverse_proxy
> >> >>> [2] https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA
> >> >>> [3]
> >> >>>
> >> >>> https://wiki.squid-cache.org/Features/CacheHierarchy?
> highlight=%28faqlisted.yes%29
> >> >>>
> >> >>> >
> >> >>> >>
> >> >>> >>
> >> >>> >> Cheers,
> >> >>> >> milan
> >> >>> >>
> >> >>> >> >
> >> >>> >> > -dennis
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> cheers
> >> >>> >> >> milan
> >> >>> >> >>
> >> >>> >> >> On Tue, May 29, 2018 at 4:25 PM, Brian Bouterse
> >> >>> >> >> <bbouters at redhat.com>
> >> >>> >> >> wrote:
> >> >>> >> >> >
> >> >>> >> >> > On Mon, May 28, 2018 at 9:57 AM, Milan Kovacik
> >> >>> >> >> > <mkovacik at redhat.com>
> >> >>> >> >> > wrote:
> >> >>> >> >> >>
> >> >>> >> >> >> Hi,
> >> >>> >> >> >>
> >> >>> >> >> >> Looking at the diagram[1] I'm wondering what's the
> reasoning
> >> >>> >> >> >> behind
> >> >>> >> >> >> Pulp having to actually fetch the content locally?
> >> >>> >> >> >
> >> >>> >> >> >
> >> >>> >> >> > Is the question "why is Pulp doing the fetching and not
> >> >>> >> >> > Squid?"
> >> >>> >> >> > or
> >> >>> >> >> > "why
> >> >>> >> >> > is
> >> >>> >> >> > Pulp storing the content after fetching it?" or both?
> >> >>> >> >> >
> >> >>> >> >> >> Couldn't Pulp just rely on the proxy with regards to the
> >> >>> >> >> >> content
> >> >>> >> >> >> streaming?
> >> >>> >> >> >>
> >> >>> >> >> >> Thanks,
> >> >>> >> >> >> milan
> >> >>> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >> >> [1] https://pulp.plan.io/attachments/130957
> >> >>> >> >> >>
> >> >>> >> >> >> On Fri, May 25, 2018 at 9:11 PM, Brian Bouterse
> >> >>> >> >> >> <bbouters at redhat.com>
> >> >>> >> >> >> wrote:
> >> >>> >> >> >> > A mini-team of core devs** met to talk through lazy use
> >> >>> >> >> >> > cases
> >> >>> >> >> >> > for
> >> >>> >> >> >> > Pulp3.
> >> >>> >> >> >> > It's effectively the same lazy from Pulp2 except:
> >> >>> >> >> >> >
> >> >>> >> >> >> > * it's now built into core (not just RPM)
> >> >>> >> >> >> > * It disincludes repo protection use cases because we
> >> >>> >> >> >> > haven't
> >> >>> >> >> >> > added
> >> >>> >> >> >> > repo
> >> >>> >> >> >> > protection to Pulp3 yet
> >> >>> >> >> >> > * It disincludes the "background" policy which based on
> >> >>> >> >> >> > feedback
> >> >>> >> >> >> > from
> >> >>> >> >> >> > stakeholders provided very little value
> >> >>> >> >> >> > * it will no longer will depend on Twisted as a
> dependency.
> >> >>> >> >> >> > It
> >> >>> >> >> >> > will
> >> >>> >> >> >> > use
> >> >>> >> >> >> > asyncio instead.
> >> >>> >> >> >> >
> >> >>> >> >> >> > While it is being built into core, it will require
> minimal
> >> >>> >> >> >> > support
> >> >>> >> >> >> > by
> >> >>> >> >> >> > a
> >> >>> >> >> >> > plugin writer to add support for it. Details in the epic
> >> >>> >> >> >> > below.
> >> >>> >> >> >> >
> >> >>> >> >> >> > The current use cases along with a technical plan are
> >> >>> >> >> >> > written
> >> >>> >> >> >> > on
> >> >>> >> >> >> > this
> >> >>> >> >> >> > epic:
> >> >>> >> >> >> > https://pulp.plan.io/issues/3693
> >> >>> >> >> >> >
> >> >>> >> >> >> > We're putting it out for comment, questions, and feedback
> >> >>> >> >> >> > before
> >> >>> >> >> >> > we
> >> >>> >> >> >> > start
> >> >>> >> >> >> > into the code. I hope we are able to add this into our
> next
> >> >>> >> >> >> > sprint.
> >> >>> >> >> >> >
> >> >>> >> >> >> > ** ipanova, jortel, ttereshc, dkliban, bmbouter
> >> >>> >> >> >> >
> >> >>> >> >> >> > Thanks!
> >> >>> >> >> >> > Brian
> >> >>> >> >> >> >
> >> >>> >> >> >> >
> >> >>> >> >> >> > _______________________________________________
> >> >>> >> >> >> > Pulp-dev mailing list
> >> >>> >> >> >> > Pulp-dev at redhat.com
> >> >>> >> >> >> > https://www.redhat.com/mailman/listinfo/pulp-dev
> >> >>> >> >> >> >
> >> >>> >> >> >
> >> >>> >> >> >
> >> >>> >> >>
> >> >>> >> >> _______________________________________________
> >> >>> >> >> Pulp-dev mailing list
> >> >>> >> >> Pulp-dev at redhat.com
> >> >>> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
> >> >>> >> >
> >> >>> >> >
> >> >>> >
> >> >>> >
> >> >>>
> >> >>> _______________________________________________
> >> >>> Pulp-dev mailing list
> >> >>> Pulp-dev at redhat.com
> >> >>> https://www.redhat.com/mailman/listinfo/pulp-dev
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Pulp-dev mailing list
> >> >> Pulp-dev at redhat.com
> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
> >> >>
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180606/d7d2b31d/attachment.htm>