[Pulp-dev] Lazy for Pulp3

Wed May 30 16:43:25 UTC 2018

No opinion on name; foreman will name it whatever they want on the front
end user experience. Devs working on pulp-2 to pulp-3 foreman transition
may desire maintaining existing names.

Yes, I'd say everything but step 14 in that diagram. In addition, I would
ensure that the squid cache size is configurable to zero so that it is
effectively a straight pull through.

I assume that all pulp-3 content types will have this as an option as well,
if the type supports it? I want straight proxy of container images, for
example. An straight proxy of files. etc.

On Wed, May 30, 2018 at 11:34 AM, Brian Bouterse <bbouters at redhat.com>
wrote:

> Actually, what about these as names?
>
> policy=immediate  -> downloads now while the task runs (no lazy). Also the
> default if unspecified.
> policy=cache-and-save   -> All the steps in the diagram. Content that is
> downloaded is saved so that it's only ever downloaded once.
> policy=cache     -> All the steps in the diagram except step 14. If squid
> pushes the bits out of the cache, it will be re-downloaded again to serve
> to other clients requesting the same bits.
>
> If ^ is better I can update the stories. Other naming ideas and use cases
> are welcome.
>
> Thanks,
> Brian
>
> On Wed, May 30, 2018 at 10:50 AM, Brian Bouterse <bbouters at redhat.com>
> wrote:
>
>>
>>
>> On Wed, May 30, 2018 at 8:57 AM, Tom McKay <thomasmckay at redhat.com>
>> wrote:
>>
>>> I think there is a usecase for "proxy only" like is being described
>>> here. Several years ago there was a project called thumbslug[1] that was
>>> used in a version of katello instead of pulp. It's job was to check
>>> entitlements and then proxy content from a cdn. The same functionality
>>> could be implemented in pulp. (Perhaps it's even as simple as telling squid
>>> not to cache anything so the content would never make it from cache to pulp
>>> in current pulp-2.)
>>>
>>
>> What would you call this policy?
>> policy=proxy?
>> policy=stream-dont-save?
>> policy=stream-no-save?
>>
>> Are the names 'on-demand' and 'immediate' clear enough? Are there better
>> names?
>>
>>>
>>> Overall I'm +1 to the idea of an only-squid version, if others think it
>>> would be useful.
>>>
>>
>> I understand describing this as a "only-squid" version, but for clarity,
>> the streamer would still be required because it is what requests the bits
>> with the correctly configured downloader (certs, proxy, etc). The streamer
>> streams the bits into squid which provides caching and client multiplexing.
>>
>> To confirm my understanding this "squid-only" policy would be the same as
>> on-demand except that it would *not* perform step 14 from the diagram here (
>> https://pulp.plan.io/issues/3693). Is that right?
>>
>>
>>>
>>> [1] https://github.com/candlepin/thumbslug
>>>
>>> On Wed, May 30, 2018 at 8:34 AM, Milan Kovacik <mkovacik at redhat.com>
>>> wrote:
>>>
>>>> On Tue, May 29, 2018 at 9:31 PM, Dennis Kliban <dkliban at redhat.com>
>>>> wrote:
>>>> > On Tue, May 29, 2018 at 11:42 AM, Milan Kovacik <mkovacik at redhat.com>
>>>> wrote:
>>>> >>
>>>> >> On Tue, May 29, 2018 at 5:13 PM, Dennis Kliban <dkliban at redhat.com>
>>>> wrote:
>>>> >> > On Tue, May 29, 2018 at 10:41 AM, Milan Kovacik <
>>>> mkovacik at redhat.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Good point!
>>>> >> >> More the second; it might be a bit crazy to utilize Squid for
>>>> that but
>>>> >> >> first, let's answer the why ;)
>>>> >> >> So why does Pulp need to store the content here?
>>>> >> >> Why don't we point the users to the Squid all the time (for the
>>>> lazy
>>>> >> >> repos)?
>>>> >> >
>>>> >> >
>>>> >> > Pulp's Streamer needs to fetch and store the content because that's
>>>> >> > Pulp's
>>>> >> > primary responsibility.
>>>> >>
>>>> >> Maybe not that much the storing but rather the content views
>>>> management?
>>>> >> I mean the partitioning into repositories, promoting.
>>>> >>
>>>> >
>>>> > Exactly this. We want Pulp users to be able to reuse content that was
>>>> > brought in using the 'on_demand' download policy in other
>>>> repositories.
>>>> I see.
>>>>
>>>> >
>>>> >>
>>>> >> If some of the content lived in Squid and some lived
>>>> >> > in Pulp, it would be difficult for the user to know what content is
>>>> >> > actually
>>>> >> > available in Pulp and what content needs to be fetched from a
>>>> remote
>>>> >> > repository.
>>>> >>
>>>> >> I'd say the rule of the thumb would be: lazy -> squid, regular ->
>>>> pulp
>>>> >> so not that difficult.
>>>> >> Maybe Pulp could have a concept of Origin, where folks upload stuff
>>>> to
>>>> >> a Pulp repo, vs. Proxy for it's repo storage policy?
>>>> >>
>>>> >
>>>> > Squid removes things from the cache at some point. You can probably
>>>> > configure it to never remove anything from the cache, but then we
>>>> would need
>>>> > to implement orphan cleanup that would work across two systems: pulp
>>>> and
>>>> > squid.
>>>>
>>>> Actually "remote" units wouldn't need orphan cleaning from the disk,
>>>> just dropping them from the DB would suffice.
>>>>
>>>> >
>>>> > Answering that question would still be difficult. Not all content
>>>> that is in
>>>> > the repository that was synced using on_demand download policy will
>>>> be in
>>>> > Squid - only the content that has been requested by clients. So it's
>>>> still
>>>> > hard to know which of the content units have been downloaded and
>>>> which have
>>>> > not been.
>>>>
>>>> But the beauty is exactly in that: we don't have to track whether the
>>>> content is downloaded if it is reverse-proxied[1][2].
>>>> Moreover, this would work both with and without a proxy between Pulp
>>>> and the Origin of the remote unit.
>>>> A "remote" content artifact might just need to carry it's URL in a DB
>>>> column for this to work; so the async artifact model, instead of the
>>>> "policy=on-demand"  would have a mandatory remote "URL" attribute; I
>>>> wouldn't say it's more complex than tracking the "policy" attribute.
>>>>
>>>> >
>>>> >
>>>> >>
>>>> >> >
>>>> >> > As Pulp downloads an Artifact, it calculates all the checksums and
>>>> it's
>>>> >> > size. It then performs validation based on information that was
>>>> provided
>>>> >> > from the RemoteArtifact. After validation is performed, the
>>>> Artifact, is
>>>> >> > saved to the database and it's final place in
>>>> >> > /var/lib/content/artifacts/.
>>>> >>
>>>> >> This could be still achieved by storing the content just temporarily
>>>> >> in the Squid proxy i.e use Squid as the content source, not the disk.
>>>> >>
>>>> >> > Once this information is in the database, Pulp's web server can
>>>> serve
>>>> >> > the
>>>> >> > content without having to involve the Streamer or Squid.
>>>> >>
>>>> >> Pulp might serve just the API and the metadata, the content might be
>>>> >> redirected to the Proxy all the time, correct?
>>>> >> Doesn't Crane do that btw?
>>>> >
>>>> >
>>>> > Theoretically we could do this, but in practice we would run into
>>>> problems
>>>> > when we needed to scale out the Content app. Right now when the
>>>> Content app
>>>> > needs to be scaled, a user can launch another machine that will run
>>>> the
>>>> > Content app. Squid does not support that kind of scaling. Squid can
>>>> only
>>>> > take advantage of additional cores in a single machine
>>>>
>>>> I don't think I understand; proxies are actually designed to scale[1]
>>>> and are used as tools to scale the web too.
>>>>
>>>> This is all about the How question but when it comes to my original
>>>> Why, please correct me if I'm being wrong, the answer so far has been:
>>>>  Pulp always downloads the content because that's what it is supposed
>>>> to do.
>>>>
>>>> Cheers,
>>>> milan
>>>>
>>>> [1] https://en.wikipedia.org/wiki/Reverse_proxy
>>>> [2] https://paste.fedoraproject.org/paste/zkBTyxZjm330FsqvPP0lIA
>>>> [3] https://wiki.squid-cache.org/Features/CacheHierarchy?highlig
>>>> ht=%28faqlisted.yes%29
>>>>
>>>> >
>>>> >>
>>>> >>
>>>> >> Cheers,
>>>> >> milan
>>>> >>
>>>> >> >
>>>> >> > -dennis
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> cheers
>>>> >> >> milan
>>>> >> >>
>>>> >> >> On Tue, May 29, 2018 at 4:25 PM, Brian Bouterse <
>>>> bbouters at redhat.com>
>>>> >> >> wrote:
>>>> >> >> >
>>>> >> >> > On Mon, May 28, 2018 at 9:57 AM, Milan Kovacik <
>>>> mkovacik at redhat.com>
>>>> >> >> > wrote:
>>>> >> >> >>
>>>> >> >> >> Hi,
>>>> >> >> >>
>>>> >> >> >> Looking at the diagram[1] I'm wondering what's the reasoning
>>>> behind
>>>> >> >> >> Pulp having to actually fetch the content locally?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Is the question "why is Pulp doing the fetching and not Squid?"
>>>> or
>>>> >> >> > "why
>>>> >> >> > is
>>>> >> >> > Pulp storing the content after fetching it?" or both?
>>>> >> >> >
>>>> >> >> >> Couldn't Pulp just rely on the proxy with regards to the
>>>> content
>>>> >> >> >> streaming?
>>>> >> >> >>
>>>> >> >> >> Thanks,
>>>> >> >> >> milan
>>>> >> >> >>
>>>> >> >> >>
>>>> >> >> >> [1] https://pulp.plan.io/attachments/130957
>>>> >> >> >>
>>>> >> >> >> On Fri, May 25, 2018 at 9:11 PM, Brian Bouterse
>>>> >> >> >> <bbouters at redhat.com>
>>>> >> >> >> wrote:
>>>> >> >> >> > A mini-team of core devs** met to talk through lazy use
>>>> cases for
>>>> >> >> >> > Pulp3.
>>>> >> >> >> > It's effectively the same lazy from Pulp2 except:
>>>> >> >> >> >
>>>> >> >> >> > * it's now built into core (not just RPM)
>>>> >> >> >> > * It disincludes repo protection use cases because we haven't
>>>> >> >> >> > added
>>>> >> >> >> > repo
>>>> >> >> >> > protection to Pulp3 yet
>>>> >> >> >> > * It disincludes the "background" policy which based on
>>>> feedback
>>>> >> >> >> > from
>>>> >> >> >> > stakeholders provided very little value
>>>> >> >> >> > * it will no longer will depend on Twisted as a dependency.
>>>> It
>>>> >> >> >> > will
>>>> >> >> >> > use
>>>> >> >> >> > asyncio instead.
>>>> >> >> >> >
>>>> >> >> >> > While it is being built into core, it will require minimal
>>>> support
>>>> >> >> >> > by
>>>> >> >> >> > a
>>>> >> >> >> > plugin writer to add support for it. Details in the epic
>>>> below.
>>>> >> >> >> >
>>>> >> >> >> > The current use cases along with a technical plan are
>>>> written on
>>>> >> >> >> > this
>>>> >> >> >> > epic:
>>>> >> >> >> > https://pulp.plan.io/issues/3693
>>>> >> >> >> >
>>>> >> >> >> > We're putting it out for comment, questions, and feedback
>>>> before
>>>> >> >> >> > we
>>>> >> >> >> > start
>>>> >> >> >> > into the code. I hope we are able to add this into our next
>>>> >> >> >> > sprint.
>>>> >> >> >> >
>>>> >> >> >> > ** ipanova, jortel, ttereshc, dkliban, bmbouter
>>>> >> >> >> >
>>>> >> >> >> > Thanks!
>>>> >> >> >> > Brian
>>>> >> >> >> >
>>>> >> >> >> >
>>>> >> >> >> > _______________________________________________
>>>> >> >> >> > Pulp-dev mailing list
>>>> >> >> >> > Pulp-dev at redhat.com
>>>> >> >> >> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>>> >> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >>
>>>> >> >> _______________________________________________
>>>> >> >> Pulp-dev mailing list
>>>> >> >> Pulp-dev at redhat.com
>>>> >> >> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180530/488452f2/attachment.htm>