[Pulp-dev] pulp3: Publishing Proposal

Wed Jun 28 19:53:02 UTC 2017

I'm generally a big believer in this direction, as many of you know. :) I
think it is achievable, and from a plugin writer perspective, would be very
similar to what they do today. Whereas in Pulp 2 a plugin creates a symlink
on disk, in Pulp 3 it would add an entry to a database table with nearly
the same information.

More thoughts in-line.

On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <jortel at redhat.com> wrote:

> I have been doing some thinking about pulp3 publishing with the following
> goals in mind:
>
> - Eliminate symlinks.
> - Eliminate need for each plugin to have its own Apache conf.
> - Prevent orphaned content that is still published from being deleted.
>
> The main concept is to store the relationship between an artifact and a
> URL in the DB instead of using the
> filesystem.  A `Publication` is created (and owned) by a publisher.  Each
> `Publication` is composed of (linked
> to) many `artifacts`.  The linkage contains the path component of the URL
> which is used to locate the artifact
> referenced by a URL.
>

A Publication should also be associated with a repo version. This is how
we'll be able to know at publish time:

- did any content change? If not, skip the publish unless publisher config
changed...
- if content did change, what changes happened? When possible, do an
incremental publish based on this info.

I think this is also the most natural way for a user to reason about
whether any given publication is current, and if not, what differences it
has from the repo contents. It gives the best visibility into what content
they have, and what content is available to clients.

>
> This covers artifacts as we know them today.  But what about files
> generated during publishing.  A.K.A.
> metadata?  I propose that these files be stored as artifacts as well.
> This requires an `Artifact` to be
> redefined slightly.  The definition would read more like:
>
>   "A file associated with either stored or published content".
>
> Or, it would be even more generic, like:
>
>   "A file contained within the pulp inventory that may be associated with
> a content (unit) or publication."
>

There is enough difference between a file that's part of a unit vs. a file
that Pulp created during a publish that I think they should be stored
separately. I recognize that the tables would be very similar, if not
identical, but I don't think we gain much from combining them.

In practice I don't think we expect that a file would ever appear both as
part of a Content unit and as something created by a publish task. They
come from two very different places, which gives them different properties.
Content likely has catalog entries, so those artifacts can be re-retrieved
at any point, even transparently from the client perspective. Publication
artifacts must be created by a publish task; if one is deleted, the whole
publication should be re-created. These differences impact how users may
backup their data, how replication may occur from one Pulp to another,
caching behavior, etc.

Signing is interesting to consider. We don't have a good plan yet for
supporting that, but we'll need it sooner than later. A user will want to
sign a specific publication, usually by signing the primary metadata file.
PULP_MANIFEST is a good example where the same one could easily be produced
by multiple different publishers and repos that happen to contain the same
files. Think about katello's multi-org use cases for example. If that
manifest gets signed, we want the signature associated with this repo and
this publication only, and never to appear with a different repo that
happens to have the same content. So this signature needs an association
with the publication itself in addition to an association with the file
being signed. Maybe the signature itself is just another file associated
with the publication.

Here is another small detail, but an important one. If we decide that an
artifact can be shared by multiple content units, we're already getting
into territory where deleting that artifact must be done with care only if
it is not associated with any other content. There's a race here that maybe
we can overcome, but is very important to stay on top of. If we also must
check a second association type to see if an artifact is associated with
content OR a publication, that makes the race more complex.

>
> In any case, the relationship to a content (unit) becomes optional.
>
> Publications are not user facing.  I think we can keep this as an internal
> core concept.  At least for the MVP.
>
> The /var/lib/pulp/published directory goes away.
>
> General Flows:
>
> Publishing: "The publisher will compose a publication"
>
> 1. Publisher creates a publication using the plugin API.
>

Does the publication have information about its base path, authorization,
etc? We've relied on the publisher for that sort of thing previously, but
maybe the publisher should use those settings as the defaults to impose
onto a publication. Wouldn't it be slick to promote a publication just by
changing the path it's made available at. Or add a second path it's made
available at...

Speaking of which, at some point I really want to disconnect the production
of a publication from the serving of it. A publication could be made
available several different ways via http (maybe several at the same time),
written to an ISO, rsync'd somewhere, torrented, actively pushed to some
other service, etc. There's already a huge demand for the ability to
publish once, and promote or otherwise interact with that published thing.
See for example the clone distributor that katello made for yum repos.

I'm worried about biting all of this off now. As you said, if it's possible
to just not expose this during the MVP, that might be best for us to add on
all the additional concepts later. We should think through them up-front
though to make sure we don't paint ourselves in a corner.

> 2. Publisher adds content artifacts to the publication.
> 3. Publisher generates some metadata files in the working dir.
> 4. Publisher adds the metadata files to the publication using the plugin
> API.  The artifacts can likely be
> created behind the scenes by the plugin API.
> 5. Publisher commits (publishes) the publication.  The plugin API ensures
> this is atomic.
>
> Client makes a GET request for content (or metadata):

> 1. Request is routed to the content (WSGI) application (just like in pulp2
> for RPM).
> 2. Query the `LinkedArtifact` table by URL path component to get the
> artifact.
> 3. forward the artifact storage path to:
>    <not stored locally>
>        streamer
>    <stored locally>
>        x-send
>

We may want different cache behavior. Files associated with units should
not change, so they can be cached for a long time. Files produced by a
publish (PULP_MANIFEST, repomd.xml, etc.) can change at any time and should
perhaps not be cached at all. It'll be important to differentiate what type
of file is being returned.

> 4. Done.
>
>
> Tables:
> =============================
>
> Publication
>   id [PK]
>   publisher_id [FK]
>   created
>   schemes
>
> LinkedArtifact
>   id [PK]
>   publication_id [FK]
>   artifact_id [FK]
>   URL
>

I'd call this relative_path instead of URL

>
>
> Examples Data:
> ==============================
>
> Publisher:
> ----------------
> publisher-1, ...
>
>
> Artifact:
> ----------------
> artifact-1, /var/lib/pulp/artifact/ff/9f373839d0/manifest
> artifact-2, /var/lib/pulp/artifact/b1/37b64a8c83/tiger.img
>
>
> Publication:
> ----------------
> publication-1, publisher-1, 6-1-2017,..
>
>
> LinkedArtifact:
> ----------------
> <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/manifest
> <id>, publication-1, artifact-2, /pulp/published/http/zoo/images/tiger.img
>
>
> URLs would be: /pulp/published/(http|https)/<path>
>
> I think the core can have a single Apache configuration that defines 2
> directories.  One HTTPS protected by
> SSL/entitlement and the other is plain HTTP.
>

We should also have the ability to serve a publication with https but not
entitlement enforcement. Auth is a separate layer in addition to SSL, and
we should also prepare ourselves to think about protecting published data
with other kinds of auth besides just client SSL certs.

>
>
> Thoughts/Comments?

Thanks for starting this conversation!

-- 

Michael Hrivnak

Principal Software Engineer, RHCE

Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/3772bc59/attachment.htm>