<div dir="ltr">One additional thought: we need to think also about users who today depend on the ability to serve static files sitting on a web server, be that httpd, or some third-party CDN service. How would we enable them to serve this published content?<div><br></div><div>I'm sure we can come up with something, but it's an important use case we must address somehow.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 28, 2017 at 3:53 PM, Michael Hrivnak <span dir="ltr"><<a href="mailto:mhrivnak@redhat.com" target="_blank">mhrivnak@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I'm generally a big believer in this direction, as many of you know. :) I think it is achievable, and from a plugin writer perspective, would be very similar to what they do today. Whereas in Pulp 2 a plugin creates a symlink on disk, in Pulp 3 it would add an entry to a database table with nearly the same information.<div><br></div><div>More thoughts in-line.</div><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Wed, Jun 28, 2017 at 2:27 PM, Jeff Ortel <span dir="ltr"><<a href="mailto:jortel@redhat.com" target="_blank">jortel@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I have been doing some thinking about pulp3 publishing with the following goals in mind:<br> <br> - Eliminate symlinks.<br> - Eliminate need for each plugin to have its own Apache conf.<br> - Prevent orphaned content that is still published from being deleted.<br> <br> The main concept is to store the relationship between an artifact and a URL in the DB instead of using the<br> filesystem. A `Publication` is created (and owned) by a publisher. Each `Publication` is composed of (linked<br> to) many `artifacts`. The linkage contains the path component of the URL which is used to locate the artifact<br> referenced by a URL.<br></blockquote><div><br></div></span><div>A Publication should also be associated with a repo version. This is how we'll be able to know at publish time:</div><div><br></div><div>- did any content change? If not, skip the publish unless publisher config changed...</div><div>- if content did change, what changes happened? When possible, do an incremental publish based on this info.</div><div><br></div><div>I think this is also the most natural way for a user to reason about whether any given publication is current, and if not, what differences it has from the repo contents. It gives the best visibility into what content they have, and what content is available to clients.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> This covers artifacts as we know them today. But what about files generated during publishing. A.K.A.<br> metadata? I propose that these files be stored as artifacts as well. This requires an `Artifact` to be<br> redefined slightly. The definition would read more like:<br> <br> "A file associated with either stored or published content".<br> <br> Or, it would be even more generic, like:<br> <br> "A file contained within the pulp inventory that may be associated with a content (unit) or publication."<br></blockquote><div><br></div></span><div>There is enough difference between a file that's part of a unit vs. a file that Pulp created during a publish that I think they should be stored separately. I recognize that the tables would be very similar, if not identical, but I don't think we gain much from combining them.</div><div><br></div><div>In practice I don't think we expect that a file would ever appear both as part of a Content unit and as something created by a publish task. They come from two very different places, which gives them different properties. Content likely has catalog entries, so those artifacts can be re-retrieved at any point, even transparently from the client perspective. Publication artifacts must be created by a publish task; if one is deleted, the whole publication should be re-created. These differences impact how users may backup their data, how replication may occur from one Pulp to another, caching behavior, etc.</div><div><br></div><div>Signing is interesting to consider. We don't have a good plan yet for supporting that, but we'll need it sooner than later. A user will want to sign a specific publication, usually by signing the primary metadata file. PULP_MANIFEST is a good example where the same one could easily be produced by multiple different publishers and repos that happen to contain the same files. Think about katello's multi-org use cases for example. If that manifest gets signed, we want the signature associated with this repo and this publication only, and never to appear with a different repo that happens to have the same content. So this signature needs an association with the publication itself in addition to an association with the file being signed. Maybe the signature itself is just another file associated with the publication.</div><div><br></div><div>Here is another small detail, but an important one. If we decide that an artifact can be shared by multiple content units, we're already getting into territory where deleting that artifact must be done with care only if it is not associated with any other content. There's a race here that maybe we can overcome, but is very important to stay on top of. If we also must check a second association type to see if an artifact is associated with content OR a publication, that makes the race more complex.</div><span class=""><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> In any case, the relationship to a content (unit) becomes optional.<br> <br> Publications are not user facing. I think we can keep this as an internal core concept. At least for the MVP.<br> <br> The /var/lib/pulp/published directory goes away.<br> <br> General Flows:<br> <br> Publishing: "The publisher will compose a publication"<br> <br> 1. Publisher creates a publication using the plugin API.<br></blockquote><div><br></div></span><div>Does the publication have information about its base path, authorization, etc? We've relied on the publisher for that sort of thing previously, but maybe the publisher should use those settings as the defaults to impose onto a publication. Wouldn't it be slick to promote a publication just by changing the path it's made available at. Or add a second path it's made available at...</div><div><br></div><div>Speaking of which, at some point I really want to disconnect the production of a publication from the serving of it. A publication could be made available several different ways via http (maybe several at the same time), written to an ISO, rsync'd somewhere, torrented, actively pushed to some other service, etc. There's already a huge demand for the ability to publish once, and promote or otherwise interact with that published thing. See for example the clone distributor that katello made for yum repos.</div><div><br></div><div>I'm worried about biting all of this off now. As you said, if it's possible to just not expose this during the MVP, that might be best for us to add on all the additional concepts later. We should think through them up-front though to make sure we don't paint ourselves in a corner.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> 2. Publisher adds content artifacts to the publication.<br> 3. Publisher generates some metadata files in the working dir.<br> 4. Publisher adds the metadata files to the publication using the plugin API. The artifacts can likely be<br> created behind the scenes by the plugin API.<br> 5. Publisher commits (publishes) the publication. The plugin API ensures this is atomic.<br> <br> Client makes a GET request for content (or metadata):</blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> 1. Request is routed to the content (WSGI) application (just like in pulp2 for RPM).<br> 2. Query the `LinkedArtifact` table by URL path component to get the artifact.<br> 3. forward the artifact storage path to:<br> <not stored locally><br> streamer<br> <stored locally><br> x-send<br></blockquote><div><br></div></span><div>We may want different cache behavior. Files associated with units should not change, so they can be cached for a long time. Files produced by a publish (PULP_MANIFEST, repomd.xml, etc.) can change at any time and should perhaps not be cached at all. It'll be important to differentiate what type of file is being returned.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> 4. Done.<br> <br> <br> Tables:<br> =============================<br> <br> Publication<br> id [PK]<br> publisher_id [FK]<br> created<br> schemes<br> <br> LinkedArtifact<br> id [PK]<br> publication_id [FK]<br> artifact_id [FK]<br> URL<br></blockquote><div><br></div></span><div>I'd call this relative_path instead of URL</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> <br> Examples Data:<br> ==============================<br> <br> Publisher:<br> ----------------<br> publisher-1, ...<br> <br> <br> Artifact:<br> ----------------<br> artifact-1, /var/lib/pulp/artifact/ff/9f37<wbr>3839d0/manifest<br> artifact-2, /var/lib/pulp/artifact/b1/37b6<wbr>4a8c83/tiger.img<br> <br> <br> Publication:<br> ----------------<br> publication-1, publisher-1, 6-1-2017,..<br> <br> <br> LinkedArtifact:<br> ----------------<br> <id>, publication-1, artifact-1, /pulp/published/http/zoo/md/ma<wbr>nifest<br> <id>, publication-1, artifact-2, /pulp/published/http/zoo/image<wbr>s/tiger.img<br> <br> <br> URLs would be: /pulp/published/(http|https)/<<wbr>path><br> <br> I think the core can have a single Apache configuration that defines 2 directories. One HTTPS protected by<br> SSL/entitlement and the other is plain HTTP.<br></blockquote><div><br></div></span><div>We should also have the ability to serve a publication with https but not entitlement enforcement. Auth is a separate layer in addition to SSL, and we should also prepare ourselves to think about protecting published data with other kinds of auth besides just client SSL certs.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br> <br> Thoughts/Comments?</blockquote><div><br></div><div>Thanks for starting this conversation! </div></div><span class="HOEnZb"><font color="#888888"><div><br></div>-- <br><div class="m_-2543836360092887816gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Michael</span> <span style="margin:0px!important;padding:0px!important">Hrivnak</span></p><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"></p><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Principal Software Engineer</span><span style="margin:0px!important;padding:0px!important">, <span style="margin:0px!important;padding:0px!important">RHCE</span></span> </span><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px"></span><br style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important">Red Hat</p></div></div> </font></span></div></div> </blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Michael</span> <span style="margin:0px!important;padding:0px!important">Hrivnak</span></p><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"></p><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Principal Software Engineer</span><span style="margin:0px!important;padding:0px!important">, <span style="margin:0px!important;padding:0px!important">RHCE</span></span> </span><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px"></span><br style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important">Red Hat</p></div></div> </div>