[Pulp-dev] pulp3: Publishing Proposal

Wed Jun 28 20:52:25 UTC 2017

On 06/28/2017 02:53 PM, Michael Hrivnak wrote:
> I'm generally a big believer in this direction, as many of you know. :) I think it is achievable, and from a
> plugin writer perspective, would be very similar to what they do today. Whereas in Pulp 2 a plugin creates a
> symlink on disk, in Pulp 3 it would add an entry to a database table with nearly the same information.
> 
> More thoughts in-line.
> 
> On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
> 
>     I have been doing some thinking about pulp3 publishing with the following goals in mind:
> 
>     - Eliminate symlinks.
>     - Eliminate need for each plugin to have its own Apache conf.
>     - Prevent orphaned content that is still published from being deleted.
> 
>     The main concept is to store the relationship between an artifact and a URL in the DB instead of using the
>     filesystem.  A `Publication` is created (and owned) by a publisher.  Each `Publication` is composed of (linked
>     to) many `artifacts`.  The linkage contains the path component of the URL which is used to locate the artifact
>     referenced by a URL.
> 
> 
> A Publication should also be associated with a repo version. This is how we'll be able to know at publish time:
> 
> - did any content change? If not, skip the publish unless publisher config changed...
> - if content did change, what changes happened? When possible, do an incremental publish based on this info.
> 
> I think this is also the most natural way for a user to reason about whether any given publication is current,
> and if not, what differences it has from the repo contents. It gives the best visibility into what content
> they have, and what content is available to clients.

Makes sense.

>  
> 
> 
>     This covers artifacts as we know them today.  But what about files generated during publishing.  A.K.A.
>     metadata?  I propose that these files be stored as artifacts as well.  This requires an `Artifact` to be
>     redefined slightly.  The definition would read more like:
> 
>       "A file associated with either stored or published content".
> 
>     Or, it would be even more generic, like:
> 
>       "A file contained within the pulp inventory that may be associated with a content (unit) or publication."
> 
> 
> There is enough difference between a file that's part of a unit vs. a file that Pulp created during a publish
> that I think they should be stored separately. I recognize that the tables would be very similar, if not
> identical, but I don't think we gain much from combining them.
> 
> In practice I don't think we expect that a file would ever appear both as part of a Content unit and as
> something created by a publish task. They come from two very different places, which gives them different
> properties. Content likely has catalog entries, so those artifacts can be re-retrieved at any point, even
> transparently from the client perspective. Publication artifacts must be created by a publish task; if one is
> deleted, the whole publication should be re-created. These differences impact how users may backup their data,
> how replication may occur from one Pulp to another, caching behavior, etc.

Good points.

I'd considered separate tables but wasn't convinced until now.

> 
> Signing is interesting to consider. We don't have a good plan yet for supporting that, but we'll need it
> sooner than later. A user will want to sign a specific publication, usually by signing the primary metadata
> file. PULP_MANIFEST is a good example where the same one could easily be produced by multiple different
> publishers and repos that happen to contain the same files. Think about katello's multi-org use cases for
> example. If that manifest gets signed, we want the signature associated with this repo and this publication
> only, and never to appear with a different repo that happens to have the same content. So this signature needs
> an association with the publication itself in addition to an association with the file being signed. Maybe the
> signature itself is just another file associated with the publication.
> 
> Here is another small detail, but an important one. If we decide that an artifact can be shared by multiple
> content units, we're already getting into territory where deleting that artifact must be done with care only
> if it is not associated with any other content. There's a race here that maybe we can overcome, but is very
> important to stay on top of. If we also must check a second association type to see if an artifact is
> associated with content OR a publication, that makes the race more complex.
> 
>  
> 
> 
>     In any case, the relationship to a content (unit) becomes optional.
> 
>     Publications are not user facing.  I think we can keep this as an internal core concept.  At least for the
>     MVP.
> 
>     The /var/lib/pulp/published directory goes away.
> 
>     General Flows:
> 
>     Publishing: "The publisher will compose a publication"
> 
>     1. Publisher creates a publication using the plugin API.
> 
> 
> Does the publication have information about its base path, authorization, etc? We've relied on the publisher
> for that sort of thing previously, but maybe the publisher should use those settings as the defaults to impose
> onto a publication. Wouldn't it be slick to promote a publication just by changing the path it's made
> available at. Or add a second path it's made available at...

I considered storing the base path in the Publication. But I don't see how the query using the /path/
component of the URL could be indexed if the path is split between the Publication and the LinkedArtifact.

Adding authorization information to the Publication sounds like a good idea.

> 
> Speaking of which, at some point I really want to disconnect the production of a publication from the serving
> of it. A publication could be made available several different ways via http (maybe several at the same time),
> written to an ISO, rsync'd somewhere, torrented, actively pushed to some other service, etc. There's already a
> huge demand for the ability to publish once, and promote or otherwise interact with that published thing. See
> for example the clone distributor that katello made for yum repos.
> 
> I'm worried about biting all of this off now. As you said, if it's possible to just not expose this during the
> MVP, that might be best for us to add on all the additional concepts later. We should think through them
> up-front though to make sure we don't paint ourselves in a corner.
>  
> 
>     2. Publisher adds content artifacts to the publication.
>     3. Publisher generates some metadata files in the working dir.
>     4. Publisher adds the metadata files to the publication using the plugin API.  The artifacts can likely be
>     created behind the scenes by the plugin API.
>     5. Publisher commits (publishes) the publication.  The plugin API ensures this is atomic.
> 
>     Client makes a GET request for content (or metadata):
> 
> 
>     1. Request is routed to the content (WSGI) application (just like in pulp2 for RPM).
>     2. Query the `LinkedArtifact` table by URL path component to get the artifact.
>     3. forward the artifact storage path to:
>        <not stored locally>
>            streamer
>        <stored locally>
>            x-send
> 
> 
> We may want different cache behavior. Files associated with units should not change, so they can be cached for
> a long time. Files produced by a publish (PULP_MANIFEST, repomd.xml, etc.) can change at any time and should
> perhaps not be cached at all. It'll be important to differentiate what type of file is being returned.
>  
> 
>     4. Done.
> 
> 
>     Tables:
>     =============================
> 
>     Publication
>       id [PK]
>       publisher_id [FK]
>       created
>       schemes
> 
>     LinkedArtifact
>       id [PK]
>       publication_id [FK]
>       artifact_id [FK]
>       URL
> 
> 
> I'd call this relative_path instead of URL

This needs to be the full path component of the URL.  Agreed URL isn't the most accurate name but for the
purposes of conveying the idea, I wanted to be sure it was clear that it supported URL matching.

>  
> 
> 
> 
>     Examples Data:
>     ==============================
> 
>     Publisher:
>     ----------------
>     publisher-1, ...
> 
> 
>     Artifact:
>     ----------------
>     artifact-1, /var/lib/pulp/artifact/ff/9f373839d0/manifest
>     artifact-2, /var/lib/pulp/artifact/b1/37b64a8c83/tiger.img
> 
> 
>     Publication:
>     ----------------
>     publication-1, publisher-1, 6-1-2017,..
> 
> 
>     LinkedArtifact:
>     ----------------
>     <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/manifest
>     <id>, publication-1, artifact-2, /pulp/published/http/zoo/images/tiger.img
> 
> 
>     URLs would be: /pulp/published/(http|https)/<path>
> 
>     I think the core can have a single Apache configuration that defines 2 directories.  One HTTPS protected by
>     SSL/entitlement and the other is plain HTTP.
> 
> 
> We should also have the ability to serve a publication with https but not entitlement enforcement. Auth is a
> separate layer in addition to SSL, and we should also prepare ourselves to think about protecting published
> data with other kinds of auth besides just client SSL certs.
>  
> 
> 
> 
>     Thoughts/Comments?
> 
> 
> Thanks for starting this conversation! 
> 
> -- 
> 
> Michael Hrivnak
> 
> Principal Software Engineer, RHCE 
> 
> Red Hat
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 847 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/4b4427d1/attachment.sig>