[Pulp-dev] Publish API for Plugin Writers (Pulp3)

Mon Apr 24 13:30:11 UTC 2017

For publish, a plugin writer needs the ability to:

- iterate through the units being published
- create new artifacts based on that iteration, or any other method it sees
fit
- make each unit's files available at a specific path either via http or on
a file store (for example, docker manifest files need to be served directly
by crane)
- make each newly-created artifact available at a specific path either via
http or on a file store (for example, metadata files for crane don't get
served via http)

Optimizations in Pulp 2 further allow a plugin writer to read artifacts
created by a previous publication. For example, the rpm plugin uses this to
quickly add a few entries to an XML file instead of completely re-creating
it. This may not strictly be required for the MVP, but its absence would
likely create a substantial performance regression. Similarly, this
requires the ability to determine which units have been added and removed
since the last publish. See versioned repos below...

As for making copies of unit files, I think if Pulp did that for each
publish, it would become effectively unusable for a lot of users. At best,
it would double the required storage, but for many users would be much
worse. It would also greatly increase the required time to perform a
publish. As such, I think the MVP should continue to store just one copy of
each unit, including its files, similar to Pulp 2. How those files are
referenced is an area we could definitely improve though. From a plugin
writer's perspective, it should be enough to tell the platform "make file X
available at location Y", and not worry about whether copies, symlinks, or
any other referencing method is being employed.

As for recording which units are available with a publication... If we
implement versioned repositories, then each repo version would be an
addressable and immutable object with references to units. A publication
would naturally then reference a repo version. How exactly we model the
repo versions could go several ways, but they all include a single
addressable object as far as I envision it. I promise I'll cook up a
specific proposal in the near future. ;)

On Mon, Apr 24, 2017 at 7:31 AM, Mihai Ibanescu <mihai.ibanescu at gmail.com>
wrote:

> Jeff,
>
> A few comments to your strawman:
>
> * What is an artifact? If it is a database model, then why not call it a
> unit if that's what it's called everywhere else in the code?
> * How would you deal with metadata-only units that don't have a file
> representation, but show up in some kind of metadata (e.g. package groups /
> errata). associate() doesn't seem to give me that.
> * For that matter, how would you deal with files that are not
> representations of units, but new artifacts? (e.g. repomd.xml and the
> like). It feels like it can be possible by extending my commit() with
> writing the metadata and then calling the parent class' commit() (which
> does the atomic publish), but I think that's not pretty.
>
>
> On Fri, Apr 21, 2017 at 5:09 PM, Jeff Ortel <jortel at redhat.com> wrote:
>
>> I like this Brian and want to take it one step further.  I think there is
>> value in abstracting how a
>> publication is composed.  Files like metadata need to be composed by the
>> publisher (as needed) in the
>> working_dir then "added" to the publication.  Artifacts could be
>> "associated" to the publication and the
>> platform determines how this happens (symlinks/in the DB).
>>
>> Assuming the Publisher is instantiated with a 'working_dir' attribute.
>>
>> ---------------------------------------
>>
>> Something like this to kick around:
>>
>>
>> class Publication:
>>     """
>>     The Publication provided by the plugin API.
>>
>>     Examples:
>>
>>     A crude example with lots of hand waving.
>>
>>     In Publisher.publish()
>>
>>     >>>
>>     >>> publication = Publication(self.working_dir)
>>     >>>
>>     >>> # Artifacts
>>     >>> for artifact in []: # artifacts
>>     >>>     path = ' <determine relative path>'
>>     >>>     publication.associate(artifact, path)
>>     >>>
>>     >>> # Metadata created in self.staging_dir <here>.
>>     >>>
>>     >>> publication.add('repodata/primary.xml')
>>     >>> publication.add('repodata/others.xml')
>>     >>> publication.add('repodata/repomd.xml')
>>     >>>
>>     >>> # - OR -
>>     >>>
>>     >>> publication.add('repodata/')
>>     >>>
>>     >>> publication.commit()
>>     """
>>
>>     def __init__(self, staging_dir):
>>         """
>>         Args:
>>             staging_dir: Absolute path to where publication is staged.
>>         """
>>         self.staging_dir = staging_dir
>>
>>     def associate(self, artifact, path):
>>         """
>>         Associate an artifact to the publication.
>>         This could result in creating a symlink in the staging directory
>>         or (later) creating a record in the db.
>>
>>         Args:
>>             artifact: A content artifact
>>             path: Relative path within the staging directory AND
>> eventually
>>                   within the published URL.
>>         """
>>
>>     def add(self, path):
>>         """
>>         Add a file within the staging directory to the publication by
>> relative path.
>>
>>         Args:
>>             path: Relative path within the staging directory AND
>> eventually within
>>                   the published URL.  When *path* is a directory, all
>> files
>>                   within the directory are added.
>>         """
>>
>>     def commit(self):
>>         """
>>         When committed, the publication is atomically published.
>>         """
>>         # atomic magic
>>
>>
>>
>>
>>
>> On 04/19/2017 10:16 AM, Brian Bouterse wrote:
>> > I was thinking about the design here and I wanted to share some
>> thoughts.
>> >
>> > For the MVP, I think a publisher implemented by a plugin developer
>> would write all files into the working
>> > directory and the platform will "atomically publish" that data into the
>> location configured by the repository.
>> > The "atomic publish" aspect would copy/stage the files in a permanent
>> location but would use a single symlink
>> > to the top level folder to go live with the data. This would make
>> atomic publication the default behavior.
>> > This runs after the publish() implemented by the plugin developer
>> returns, after it has written all of its
>> > data to the working dir.
>> >
>> > Note that ^ allows for the plugin writer to write the actual contents
>> of files in the working directory
>> > instead of symlinks, causing Pulp to duplicate all content on disk with
>> every publish. That would be a
>> > incredibly inefficient way to write a plugin but it's something the
>> platform would not prevent in any explicit
>> > way. I'm not sure if this is something we should improve on or not.
>> >
>> > At a later point, we could add in the incremental publish maybe as a
>> method on a Publisher called
>> > incremental_publish() which would only be called if the previous
>> publish only had units added.
>> >
>> >
>> >
>> > On Mon, Apr 17, 2017 at 4:22 PM, Brian Bouterse <bbouters at redhat.com
>> <mailto:bbouters at redhat.com>> wrote:
>> >
>> >     For plugin writers who are writing a publisher for Pulp3, what do
>> they need to handle during publishing
>> >     versus platform? To make a comparison against sync, the "Download
>> API" and "Changesets" [0] allows the
>> >     plugin writer to tell platform about a remote piece of content.
>> Then platform handles creating the unit,
>> >     fetching it, and saving it. Will there be a similar API to support
>> publishing to ease the burden of a
>> >     plugin writer? Also will this allow platform to have a structured
>> knowledge of a publication with Pulp3?
>> >
>> >     I wanted to try to characterize the problem statement as two
>> separate questions:
>> >
>> >     1) How will units be recorded to allow platform to know which units
>> comprise a specific publish?
>> >     2) What are plugin writer's needs at publish time, and what
>> repetitive tasks could be moved to platform?
>> >
>> >     As a quick recalling of how Pulp2 works. Each publisher would write
>> files into the working directory and
>> >     then they would get moved into their permanent home. Also there is
>> the incrementalPublisher base machinery
>> >     which allowed for an additive publication which would use the
>> previous publish and was "faster". Finally
>> >     in Pulp2, the only record of a publication are the symlinks on the
>> filesystem.
>> >
>> >     I have some of my own ideas on these things, but I'll start the
>> conversation.
>> >
>> >     [0]: https://github.com/pulp/pulp/pull/2876 <
>> https://github.com/pulp/pulp/pull/2876>
>> >
>> >     -Brian
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Pulp-dev mailing list
>> > Pulp-dev at redhat.com
>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>> >
>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>

-- 

Michael Hrivnak

Principal Software Engineer, RHCE

Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170424/a2f90095/attachment.htm>