[Pulp-dev] proposing changes to pulp 3 upload API

Wed Jun 28 22:38:35 UTC 2017

On Wed, Jun 28, 2017 at 1:10 PM, Jeff Ortel <jortel at redhat.com> wrote:

>
>
> On 06/28/2017 11:44 AM, Brian Bouterse wrote:
> > For a file to be received and saved in the right place once, we need the
> view saving the file to have all the
> > info to form the complete path. After talking w/ @jortel, I think we
> should store Artifacts at the following path:
> >
> > MEDIA_ROOT/content/units/digest[0:2]/digest[2:]/<rel_path>
>
> Consider:
> MEDIA_ROOT/artifact/digest[0:2]/digest[2:]/<rel_path>
>
> Since artifact would have an optional association with content.  And,
> given the many-to-many relationship, the
> content_id FK would not longer exist in the Artifact table.  Also, I have
> more plans for Artifacts in a
> "Publishing" proposal I'm writing to pulp-dev (spoiler alert).
>
> We would also want to enforce the same CAS (content addressed storage)
> uniqueness in the DB using a unique
> constraint on the Artifact.  Eg: unique (sha256, rel_path).  This ensure
> that each unique artifact (file) has
> exactly 1 DB record.
>
>
I don't think it makes sense for an Artifact to have a rel_path. It is just
a file. A ContentUnit should have the rel_path that will be used at publish
time to make the file backing the Artifact available at that rel_path. Is
my understanding of the rel_path correct? In that case the only thing that
should be unique is the sha256 digest. I've written a story that outlines
this use case: https://pulp.plan.io/issues/2843

> >
> > Note that digest is the Artifact's sha256 digest. This is different from
> pulp2 which used the digest of the
> > content unit. Note that <rel_path> would be provided by the user along
> with <size> and/or <checksum_digest>.
> >
> > Note that this will cause an Artifact to live in exactly one place which
> means Artifacts are now unique by
> > digest and would need to be able to be associated with multiple content
> units. I'm not sure why we didn't do
> > this before, so I'm interested in exploring issues associated with this.
> >
> > It would be a good workflow. For a single file content unit (e.g.) rpm
> upload would be a two step process.
> >
> > 1. POST/PUT the file's binary data and the <relative_path> and <size>
> and/or <checksum_digest> as GET parameters
> > 2. Create a content unit with the unit metadata, and 0 .. n Artifacts
> referred to by ID. This could optionally
> > associate the new unit with one repository as part of the atomic unit
> creation.
> >
> > Thoughts/Ideas?
> >
> > -Brian
> >
> >
> > On Tue, Jun 27, 2017 at 4:16 PM, Dennis Kliban <dkliban at redhat.com
> <mailto:dkliban at redhat.com>> wrote:
> >
> >     On Tue, Jun 27, 2017 at 3:31 PM, Michael Hrivnak <
> mhrivnak at redhat.com <mailto:mhrivnak at redhat.com>> wrote:
> >
> >         Could you re-summarize what problem would be solved by not
> having a FileUpload model, and giving the
> >         Artifact model the ability to have partial data and no Content
> foreign key?
> >
> >         I understand the concern about where on the filesystem the data
> gets written and how many times, but
> >         I'm not seeing how that's related to whether we have a
> FileUpload model or not. Are we discussing two
> >         separate issues? 1) filesystem locations and copy efficiency,
> and 2) API design? Or is this discussion
> >         trying to connect them in a way I'm not seeing?
> >
> >
> >     There were two concerns: 1) Filesystem  location and copy efficiency
> 2) API design
> >
> >     The first one has been addressed. Thank you for pointing out that a
> second write will be a move operation.
> >
> >     However, I am still concerned about the complexity of the API. A
> relatively small file should not require
> >     an upload session to be uploaded. A single API call to the Artifacts
> API should be enough to upload a file
> >     and create an Artifact from it. In Pulp 3.1+ we can introduce the
> FileUpload model to support chunked
> >     uploads. At the same time we would extend the Artifact API to accept
> a FileUpload id for creating an
> >     Artifact.
> >
> >
> >         On Tue, Jun 27, 2017 at 3:20 PM, Dennis Kliban <
> dkliban at redhat.com <mailto:dkliban at redhat.com>> wrote:
> >
> >             On Tue, Jun 27, 2017 at 2:56 PM, Brian Bouterse <
> bbouters at redhat.com <mailto:bbouters at redhat.com>>
> >             wrote:
> >
> >                 Picking up from @jortel's observations...
> >
> >                 +1 to allowing Artifacts to have an optional FK.
> >
> >                 If we have an Artifacts endpoint then we can allow for
> the deleting of a single artifact if it
> >                 has no FK. I think we want to disallow the removal of an
> Artifact that has a foreign key. Also
> >                 filtering should allow a single operation to clean up
> all unassociated artifacts by searching
> >                 for FK=None or similar.
> >
> >                 Yes, we will need to allow the single call delivering a
> file to also specify the relative
> >                 path, size, checksums etc. Since the POST body contains
> binary data we either need to accept
> >                 this data as GET style params or use a multi-part MIME
> upload [0]. Note that this creation of
> >                 an Artifact does not change the repository contents and
> therefore can be handled synchronously
> >                 outside of the tasking system.
> >
> >                 +1 to the saving of an Artifact to perform validation
> >
> >                 [0]: https://www.w3.org/Protocols/
> rfc1341/7_2_Multipart.html
> >                 <https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
> >
> >
> >
> >
> >                 -Brian
> >
> >
> >             I also support this optional FK for Artifacts and validation
> on save.  We should probably stick
> >             with accepting GET parameters for the MVP. Though multi-part
> MIME support would be good to
> >             consider for 3.1+.
> >
> >
> >
> >                 On Tue, Jun 27, 2017 at 2:44 PM, Dennis Kliban <
> dkliban at redhat.com
> >                 <mailto:dkliban at redhat.com>> wrote:
> >
> >                     On Tue, Jun 27, 2017 at 1:24 PM, Michael Hrivnak <
> mhrivnak at redhat.com
> >                     <mailto:mhrivnak at redhat.com>> wrote:
> >
> >
> >                         On Tue, Jun 27, 2017 at 11:27 AM, Jeff Ortel <
> jortel at redhat.com
> >                         <mailto:jortel at redhat.com>> wrote:
> >
> >
> >                             - The artifact FK to a content unit would
> need to become optional.
> >
> >                             - Need to add use cases for cleaning up
> artifacts not associated with a content unit.
> >
> >                             - The upload API would need additional
> information needed to create an artifact.
> >                             Like relative path, size,
> >                             checksums etc.
> >
> >                             - Since (I assume) you are proposing
> uploading/writing directly to artifact
> >                             storage (not staging in a working
> >                             dir), the flow would need to involve
> (optional) validation.  If validation fails,
> >                             the artifact must not be
> >                             inserted into the DB.
> >
> >
> >                         Perhaps a decent middle ground would be to stick
> with the plan of keeping uploaded (or
> >                         partially uploaded) files as a separate model
> until they are ready to be turned into a
> >                         Content instance plus artifacts, and save their
> file data directly to somewhere within
> >                         /var/lib/pulp/. It would be some path distinct
> from where Artifacts are stored. That's
> >                         what I had imagined we would do anyway. Then as
> Dennis pointed out, turning that into
> >                         an Artifact would only require a move operation
> on the same filesystem, which is
> >                         super-cheap.
> >
> >
> >                         Would that address all the concerns? We'd write
> the data just once, and then move it
> >                         once on the same filesystem. I haven't looked at
> django's support for this recently,
> >                         but it seems like it should be doable.
> >
> >                     I was just looking at the dropbox API and noticed
> that they provide two separate API
> >                     endpoints for regular file uploads[0] (< 150mb) and
> large file uploads[1]. It is the
> >                     latter that supports chunking and requires using an
> upload id. For the most common case
> >                     they support uploading a file with one API call. Our
> original proposal requires 2 for the
> >                     same use case. Pulp API users would appreciate
> having to only make one API call to upload
> >                     a file.
> >
> >                     [0] https://www.dropbox.com/
> developers-v1/core/docs#files_put
> >                     <https://www.dropbox.com/
> developers-v1/core/docs#files_put>
> >                     [1] https://www.dropbox.com/developers-v1/core/docs#
> chunked-upload
> >                     <https://www.dropbox.com/developers-v1/core/docs#
> chunked-upload>
> >
> >
> >
> >                         --
> >
> >                         Michael Hrivnak
> >
> >                         Principal Software Engineer, RHCE
> >
> >                         Red Hat
> >
> >
> >                         _______________________________________________
> >                         Pulp-dev mailing list
> >                         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> >                         https://www.redhat.com/mailman/listinfo/pulp-dev
> >                         <https://www.redhat.com/
> mailman/listinfo/pulp-dev>
> >
> >
> >
> >                     _______________________________________________
> >                     Pulp-dev mailing list
> >                     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
> >                     https://www.redhat.com/mailman/listinfo/pulp-dev
> >                     <https://www.redhat.com/mailman/listinfo/pulp-dev>
> >
> >
> >
> >
> >
> >
> >         --
> >
> >         Michael Hrivnak
> >
> >         Principal Software Engineer, RHCE
> >
> >         Red Hat
> >
> >
> >
> >
> >
> > _______________________________________________
> > Pulp-dev mailing list
> > Pulp-dev at redhat.com
> > https://www.redhat.com/mailman/listinfo/pulp-dev
> >
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/c0557779/attachment.htm>