[Pulp-dev] pulp3: Publishing Proposal
Jeff Ortel
jortel at redhat.com
Wed Jun 28 20:52:25 UTC 2017
On 06/28/2017 02:53 PM, Michael Hrivnak wrote:
> I'm generally a big believer in this direction, as many of you know. :) I think it is achievable, and from a
> plugin writer perspective, would be very similar to what they do today. Whereas in Pulp 2 a plugin creates a
> symlink on disk, in Pulp 3 it would add an entry to a database table with nearly the same information.
>
> More thoughts in-line.
>
> On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <jortel at redhat.com <mailto:jortel at redhat.com>> wrote:
>
> I have been doing some thinking about pulp3 publishing with the following goals in mind:
>
> - Eliminate symlinks.
> - Eliminate need for each plugin to have its own Apache conf.
> - Prevent orphaned content that is still published from being deleted.
>
> The main concept is to store the relationship between an artifact and a URL in the DB instead of using the
> filesystem. A `Publication` is created (and owned) by a publisher. Each `Publication` is composed of (linked
> to) many `artifacts`. The linkage contains the path component of the URL which is used to locate the artifact
> referenced by a URL.
>
>
> A Publication should also be associated with a repo version. This is how we'll be able to know at publish time:
>
> - did any content change? If not, skip the publish unless publisher config changed...
> - if content did change, what changes happened? When possible, do an incremental publish based on this info.
>
> I think this is also the most natural way for a user to reason about whether any given publication is current,
> and if not, what differences it has from the repo contents. It gives the best visibility into what content
> they have, and what content is available to clients.
Makes sense.
>
>
>
> This covers artifacts as we know them today. But what about files generated during publishing. A.K.A.
> metadata? I propose that these files be stored as artifacts as well. This requires an `Artifact` to be
> redefined slightly. The definition would read more like:
>
> "A file associated with either stored or published content".
>
> Or, it would be even more generic, like:
>
> "A file contained within the pulp inventory that may be associated with a content (unit) or publication."
>
>
> There is enough difference between a file that's part of a unit vs. a file that Pulp created during a publish
> that I think they should be stored separately. I recognize that the tables would be very similar, if not
> identical, but I don't think we gain much from combining them.
>
> In practice I don't think we expect that a file would ever appear both as part of a Content unit and as
> something created by a publish task. They come from two very different places, which gives them different
> properties. Content likely has catalog entries, so those artifacts can be re-retrieved at any point, even
> transparently from the client perspective. Publication artifacts must be created by a publish task; if one is
> deleted, the whole publication should be re-created. These differences impact how users may backup their data,
> how replication may occur from one Pulp to another, caching behavior, etc.
Good points.
I'd considered separate tables but wasn't convinced until now.
>
> Signing is interesting to consider. We don't have a good plan yet for supporting that, but we'll need it
> sooner than later. A user will want to sign a specific publication, usually by signing the primary metadata
> file. PULP_MANIFEST is a good example where the same one could easily be produced by multiple different
> publishers and repos that happen to contain the same files. Think about katello's multi-org use cases for
> example. If that manifest gets signed, we want the signature associated with this repo and this publication
> only, and never to appear with a different repo that happens to have the same content. So this signature needs
> an association with the publication itself in addition to an association with the file being signed. Maybe the
> signature itself is just another file associated with the publication.
>
> Here is another small detail, but an important one. If we decide that an artifact can be shared by multiple
> content units, we're already getting into territory where deleting that artifact must be done with care only
> if it is not associated with any other content. There's a race here that maybe we can overcome, but is very
> important to stay on top of. If we also must check a second association type to see if an artifact is
> associated with content OR a publication, that makes the race more complex.
>
>
>
>
> In any case, the relationship to a content (unit) becomes optional.
>
> Publications are not user facing. I think we can keep this as an internal core concept. At least for the
> MVP.
>
> The /var/lib/pulp/published directory goes away.
>
> General Flows:
>
> Publishing: "The publisher will compose a publication"
>
> 1. Publisher creates a publication using the plugin API.
>
>
> Does the publication have information about its base path, authorization, etc? We've relied on the publisher
> for that sort of thing previously, but maybe the publisher should use those settings as the defaults to impose
> onto a publication. Wouldn't it be slick to promote a publication just by changing the path it's made
> available at. Or add a second path it's made available at...
I considered storing the base path in the Publication. But I don't see how the query using the /path/
component of the URL could be indexed if the path is split between the Publication and the LinkedArtifact.
Adding authorization information to the Publication sounds like a good idea.
>
> Speaking of which, at some point I really want to disconnect the production of a publication from the serving
> of it. A publication could be made available several different ways via http (maybe several at the same time),
> written to an ISO, rsync'd somewhere, torrented, actively pushed to some other service, etc. There's already a
> huge demand for the ability to publish once, and promote or otherwise interact with that published thing. See
> for example the clone distributor that katello made for yum repos.
>
> I'm worried about biting all of this off now. As you said, if it's possible to just not expose this during the
> MVP, that might be best for us to add on all the additional concepts later. We should think through them
> up-front though to make sure we don't paint ourselves in a corner.
>
>
> 2. Publisher adds content artifacts to the publication.
> 3. Publisher generates some metadata files in the working dir.
> 4. Publisher adds the metadata files to the publication using the plugin API. The artifacts can likely be
> created behind the scenes by the plugin API.
> 5. Publisher commits (publishes) the publication. The plugin API ensures this is atomic.
>
> Client makes a GET request for content (or metadata):
>
>
> 1. Request is routed to the content (WSGI) application (just like in pulp2 for RPM).
> 2. Query the `LinkedArtifact` table by URL path component to get the artifact.
> 3. forward the artifact storage path to:
> <not stored locally>
> streamer
> <stored locally>
> x-send
>
>
> We may want different cache behavior. Files associated with units should not change, so they can be cached for
> a long time. Files produced by a publish (PULP_MANIFEST, repomd.xml, etc.) can change at any time and should
> perhaps not be cached at all. It'll be important to differentiate what type of file is being returned.
>
>
> 4. Done.
>
>
> Tables:
> =============================
>
> Publication
> id [PK]
> publisher_id [FK]
> created
> schemes
>
> LinkedArtifact
> id [PK]
> publication_id [FK]
> artifact_id [FK]
> URL
>
>
> I'd call this relative_path instead of URL
This needs to be the full path component of the URL. Agreed URL isn't the most accurate name but for the
purposes of conveying the idea, I wanted to be sure it was clear that it supported URL matching.
>
>
>
>
> Examples Data:
> ==============================
>
> Publisher:
> ----------------
> publisher-1, ...
>
>
> Artifact:
> ----------------
> artifact-1, /var/lib/pulp/artifact/ff/9f373839d0/manifest
> artifact-2, /var/lib/pulp/artifact/b1/37b64a8c83/tiger.img
>
>
> Publication:
> ----------------
> publication-1, publisher-1, 6-1-2017,..
>
>
> LinkedArtifact:
> ----------------
> <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/manifest
> <id>, publication-1, artifact-2, /pulp/published/http/zoo/images/tiger.img
>
>
> URLs would be: /pulp/published/(http|https)/<path>
>
> I think the core can have a single Apache configuration that defines 2 directories. One HTTPS protected by
> SSL/entitlement and the other is plain HTTP.
>
>
> We should also have the ability to serve a publication with https but not entitlement enforcement. Auth is a
> separate layer in addition to SSL, and we should also prepare ourselves to think about protecting published
> data with other kinds of auth besides just client SSL certs.
>
>
>
>
> Thoughts/Comments?
>
>
> Thanks for starting this conversation!
>
> --
>
> Michael Hrivnak
>
> Principal Software Engineer, RHCE
>
> Red Hat
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 847 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/4b4427d1/attachment.sig>
More information about the Pulp-dev
mailing list