[Pulp-dev] Publish API for Plugin Writers (Pulp3)

Mon Apr 24 14:06:39 UTC 2017

Irrespective to the MVP proposal, but to address one of Michael's comments:
incremental repo creation is not simply for performance reasons. Because of
how yum clients work (and I can only attest to this empirically since I
have not read the code), yum repositories need to preserve a few
generations of the (potentially compressed) xml files referenced in
previous repomd.xml files. Otherwise, a yum client with a cached copy of
repomd.xml may ask for a primary.xml.gz that got removed by a new publish,
and things don't look pretty after a 404 on that. I think not including
this possibility in the MVP will result in a *functional* regression.

But I absolutely love the idea of versioned repositories - see my attempt
to address that with the https://github.com/sassoftware/pulp-snapshot
distributor.

Michael, on your point number 4 - in pulp 2 I was under the impression that
the publisher is only responsible with creating a directory representation
of a pulp repository (in the case of the yum distributor, it's a directory
of a yum repository). Apache is responsible with serving that further, with
or without additional authentication. Are you suggesting more than this
behavior for pulp 3?

On Mon, Apr 24, 2017 at 9:30 AM, Michael Hrivnak <mhrivnak at redhat.com>
wrote:

> For publish, a plugin writer needs the ability to:
>
> - iterate through the units being published
> - create new artifacts based on that iteration, or any other method it
> sees fit
> - make each unit's files available at a specific path either via http or
> on a file store (for example, docker manifest files need to be served
> directly by crane)
> - make each newly-created artifact available at a specific path either via
> http or on a file store (for example, metadata files for crane don't get
> served via http)
>
> Optimizations in Pulp 2 further allow a plugin writer to read artifacts
> created by a previous publication. For example, the rpm plugin uses this to
> quickly add a few entries to an XML file instead of completely re-creating
> it. This may not strictly be required for the MVP, but its absence would
> likely create a substantial performance regression. Similarly, this
> requires the ability to determine which units have been added and removed
> since the last publish. See versioned repos below...
>
> As for making copies of unit files, I think if Pulp did that for each
> publish, it would become effectively unusable for a lot of users. At best,
> it would double the required storage, but for many users would be much
> worse. It would also greatly increase the required time to perform a
> publish. As such, I think the MVP should continue to store just one copy of
> each unit, including its files, similar to Pulp 2. How those files are
> referenced is an area we could definitely improve though. From a plugin
> writer's perspective, it should be enough to tell the platform "make file X
> available at location Y", and not worry about whether copies, symlinks, or
> any other referencing method is being employed.
>
> As for recording which units are available with a publication... If we
> implement versioned repositories, then each repo version would be an
> addressable and immutable object with references to units. A publication
> would naturally then reference a repo version. How exactly we model the
> repo versions could go several ways, but they all include a single
> addressable object as far as I envision it. I promise I'll cook up a
> specific proposal in the near future. ;)
>
>
>
> On Mon, Apr 24, 2017 at 7:31 AM, Mihai Ibanescu <mihai.ibanescu at gmail.com>
> wrote:
>
>> Jeff,
>>
>> A few comments to your strawman:
>>
>> * What is an artifact? If it is a database model, then why not call it a
>> unit if that's what it's called everywhere else in the code?
>> * How would you deal with metadata-only units that don't have a file
>> representation, but show up in some kind of metadata (e.g. package groups /
>> errata). associate() doesn't seem to give me that.
>> * For that matter, how would you deal with files that are not
>> representations of units, but new artifacts? (e.g. repomd.xml and the
>> like). It feels like it can be possible by extending my commit() with
>> writing the metadata and then calling the parent class' commit() (which
>> does the atomic publish), but I think that's not pretty.
>>
>>
>> On Fri, Apr 21, 2017 at 5:09 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>
>>> I like this Brian and want to take it one step further.  I think there
>>> is value in abstracting how a
>>> publication is composed.  Files like metadata need to be composed by the
>>> publisher (as needed) in the
>>> working_dir then "added" to the publication.  Artifacts could be
>>> "associated" to the publication and the
>>> platform determines how this happens (symlinks/in the DB).
>>>
>>> Assuming the Publisher is instantiated with a 'working_dir' attribute.
>>>
>>> ---------------------------------------
>>>
>>> Something like this to kick around:
>>>
>>>
>>> class Publication:
>>>     """
>>>     The Publication provided by the plugin API.
>>>
>>>     Examples:
>>>
>>>     A crude example with lots of hand waving.
>>>
>>>     In Publisher.publish()
>>>
>>>     >>>
>>>     >>> publication = Publication(self.working_dir)
>>>     >>>
>>>     >>> # Artifacts
>>>     >>> for artifact in []: # artifacts
>>>     >>>     path = ' <determine relative path>'
>>>     >>>     publication.associate(artifact, path)
>>>     >>>
>>>     >>> # Metadata created in self.staging_dir <here>.
>>>     >>>
>>>     >>> publication.add('repodata/primary.xml')
>>>     >>> publication.add('repodata/others.xml')
>>>     >>> publication.add('repodata/repomd.xml')
>>>     >>>
>>>     >>> # - OR -
>>>     >>>
>>>     >>> publication.add('repodata/')
>>>     >>>
>>>     >>> publication.commit()
>>>     """
>>>
>>>     def __init__(self, staging_dir):
>>>         """
>>>         Args:
>>>             staging_dir: Absolute path to where publication is staged.
>>>         """
>>>         self.staging_dir = staging_dir
>>>
>>>     def associate(self, artifact, path):
>>>         """
>>>         Associate an artifact to the publication.
>>>         This could result in creating a symlink in the staging directory
>>>         or (later) creating a record in the db.
>>>
>>>         Args:
>>>             artifact: A content artifact
>>>             path: Relative path within the staging directory AND
>>> eventually
>>>                   within the published URL.
>>>         """
>>>
>>>     def add(self, path):
>>>         """
>>>         Add a file within the staging directory to the publication by
>>> relative path.
>>>
>>>         Args:
>>>             path: Relative path within the staging directory AND
>>> eventually within
>>>                   the published URL.  When *path* is a directory, all
>>> files
>>>                   within the directory are added.
>>>         """
>>>
>>>     def commit(self):
>>>         """
>>>         When committed, the publication is atomically published.
>>>         """
>>>         # atomic magic
>>>
>>>
>>>
>>>
>>>
>>> On 04/19/2017 10:16 AM, Brian Bouterse wrote:
>>> > I was thinking about the design here and I wanted to share some
>>> thoughts.
>>> >
>>> > For the MVP, I think a publisher implemented by a plugin developer
>>> would write all files into the working
>>> > directory and the platform will "atomically publish" that data into
>>> the location configured by the repository.
>>> > The "atomic publish" aspect would copy/stage the files in a permanent
>>> location but would use a single symlink
>>> > to the top level folder to go live with the data. This would make
>>> atomic publication the default behavior.
>>> > This runs after the publish() implemented by the plugin developer
>>> returns, after it has written all of its
>>> > data to the working dir.
>>> >
>>> > Note that ^ allows for the plugin writer to write the actual contents
>>> of files in the working directory
>>> > instead of symlinks, causing Pulp to duplicate all content on disk
>>> with every publish. That would be a
>>> > incredibly inefficient way to write a plugin but it's something the
>>> platform would not prevent in any explicit
>>> > way. I'm not sure if this is something we should improve on or not.
>>> >
>>> > At a later point, we could add in the incremental publish maybe as a
>>> method on a Publisher called
>>> > incremental_publish() which would only be called if the previous
>>> publish only had units added.
>>> >
>>> >
>>> >
>>> > On Mon, Apr 17, 2017 at 4:22 PM, Brian Bouterse <bbouters at redhat.com
>>> <mailto:bbouters at redhat.com>> wrote:
>>> >
>>> >     For plugin writers who are writing a publisher for Pulp3, what do
>>> they need to handle during publishing
>>> >     versus platform? To make a comparison against sync, the "Download
>>> API" and "Changesets" [0] allows the
>>> >     plugin writer to tell platform about a remote piece of content.
>>> Then platform handles creating the unit,
>>> >     fetching it, and saving it. Will there be a similar API to support
>>> publishing to ease the burden of a
>>> >     plugin writer? Also will this allow platform to have a structured
>>> knowledge of a publication with Pulp3?
>>> >
>>> >     I wanted to try to characterize the problem statement as two
>>> separate questions:
>>> >
>>> >     1) How will units be recorded to allow platform to know which
>>> units comprise a specific publish?
>>> >     2) What are plugin writer's needs at publish time, and what
>>> repetitive tasks could be moved to platform?
>>> >
>>> >     As a quick recalling of how Pulp2 works. Each publisher would
>>> write files into the working directory and
>>> >     then they would get moved into their permanent home. Also there is
>>> the incrementalPublisher base machinery
>>> >     which allowed for an additive publication which would use the
>>> previous publish and was "faster". Finally
>>> >     in Pulp2, the only record of a publication are the symlinks on the
>>> filesystem.
>>> >
>>> >     I have some of my own ideas on these things, but I'll start the
>>> conversation.
>>> >
>>> >     [0]: https://github.com/pulp/pulp/pull/2876 <
>>> https://github.com/pulp/pulp/pull/2876>
>>> >
>>> >     -Brian
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Pulp-dev mailing list
>>> > Pulp-dev at redhat.com
>>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
>
> --
>
> Michael Hrivnak
>
> Principal Software Engineer, RHCE
>
> Red Hat
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170424/1c716f14/attachment.htm>