[Pulp-dev] proposing changes to pulp 3 upload API

Dennis Kliban dkliban at redhat.com
Thu Jun 29 11:22:41 UTC 2017


On Wed, Jun 28, 2017 at 11:55 PM, Michael Hrivnak <mhrivnak at redhat.com>
wrote:

> For a unit like a Distribution, the relative path of the file does matter
> with respect to other files associated with the same content unit and needs
> to be preserved. That content type consists of a collection of arbitrary
> files in an arbitrary directory structure that we have to just preserve. I
> suppose it could be a field on the many-to-many table between an artifact
> and its content. If it were a field on the artifact, and it were part of
> the uniqueness constraint together with the checksum, that could also work.
> But it would oppose the goal of deduplicating files.
>

I definitely misspoke in my last email about the relative path belonging to
the content unit. As you pointed it out it belongs with the ContentArtifact
which represents the relationship between an artifact and a content unit.


>
> Speaking of, is that the goal of making artifacts and content a
> many-to-many relationship? Otherwise could someone re-summarize why that's
> being proposed?
>


The many to many relationship is between Artifact and ContentArtifact. This
allows a content unit to have multiple Artifacts associated with it.


>
> On Wed, Jun 28, 2017 at 6:38 PM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> On Wed, Jun 28, 2017 at 1:10 PM, Jeff Ortel <jortel at redhat.com> wrote:
>>
>>>
>>>
>>> On 06/28/2017 11:44 AM, Brian Bouterse wrote:
>>> > For a file to be received and saved in the right place once, we need
>>> the view saving the file to have all the
>>> > info to form the complete path. After talking w/ @jortel, I think we
>>> should store Artifacts at the following path:
>>> >
>>> > MEDIA_ROOT/content/units/digest[0:2]/digest[2:]/<rel_path>
>>>
>>> Consider:
>>> MEDIA_ROOT/artifact/digest[0:2]/digest[2:]/<rel_path>
>>>
>>> Since artifact would have an optional association with content.  And,
>>> given the many-to-many relationship, the
>>> content_id FK would not longer exist in the Artifact table.  Also, I
>>> have more plans for Artifacts in a
>>> "Publishing" proposal I'm writing to pulp-dev (spoiler alert).
>>>
>>> We would also want to enforce the same CAS (content addressed storage)
>>> uniqueness in the DB using a unique
>>> constraint on the Artifact.  Eg: unique (sha256, rel_path).  This ensure
>>> that each unique artifact (file) has
>>> exactly 1 DB record.
>>>
>>>
>> I don't think it makes sense for an Artifact to have a rel_path. It is
>> just a file. A ContentUnit should have the rel_path that will be used at
>> publish time to make the file backing the Artifact available at that
>> rel_path. Is my understanding of the rel_path correct? In that case the
>> only thing that should be unique is the sha256 digest. I've written a story
>> that outlines this use case: https://pulp.plan.io/issues/2843
>>
>>
>>
>>
>>
>>> >
>>> > Note that digest is the Artifact's sha256 digest. This is different
>>> from pulp2 which used the digest of the
>>> > content unit. Note that <rel_path> would be provided by the user along
>>> with <size> and/or <checksum_digest>.
>>> >
>>> > Note that this will cause an Artifact to live in exactly one place
>>> which means Artifacts are now unique by
>>> > digest and would need to be able to be associated with multiple
>>> content units. I'm not sure why we didn't do
>>> > this before, so I'm interested in exploring issues associated with
>>> this.
>>> >
>>> > It would be a good workflow. For a single file content unit (e.g.) rpm
>>> upload would be a two step process.
>>> >
>>> > 1. POST/PUT the file's binary data and the <relative_path> and <size>
>>> and/or <checksum_digest> as GET parameters
>>> > 2. Create a content unit with the unit metadata, and 0 .. n Artifacts
>>> referred to by ID. This could optionally
>>> > associate the new unit with one repository as part of the atomic unit
>>> creation.
>>> >
>>> > Thoughts/Ideas?
>>> >
>>> > -Brian
>>> >
>>> >
>>> > On Tue, Jun 27, 2017 at 4:16 PM, Dennis Kliban <dkliban at redhat.com
>>> <mailto:dkliban at redhat.com>> wrote:
>>> >
>>> >     On Tue, Jun 27, 2017 at 3:31 PM, Michael Hrivnak <
>>> mhrivnak at redhat.com <mailto:mhrivnak at redhat.com>> wrote:
>>> >
>>> >         Could you re-summarize what problem would be solved by not
>>> having a FileUpload model, and giving the
>>> >         Artifact model the ability to have partial data and no Content
>>> foreign key?
>>> >
>>> >         I understand the concern about where on the filesystem the
>>> data gets written and how many times, but
>>> >         I'm not seeing how that's related to whether we have a
>>> FileUpload model or not. Are we discussing two
>>> >         separate issues? 1) filesystem locations and copy efficiency,
>>> and 2) API design? Or is this discussion
>>> >         trying to connect them in a way I'm not seeing?
>>> >
>>> >
>>> >     There were two concerns: 1) Filesystem  location and copy
>>> efficiency 2) API design
>>> >
>>> >     The first one has been addressed. Thank you for pointing out that
>>> a second write will be a move operation.
>>> >
>>> >     However, I am still concerned about the complexity of the API. A
>>> relatively small file should not require
>>> >     an upload session to be uploaded. A single API call to the
>>> Artifacts API should be enough to upload a file
>>> >     and create an Artifact from it. In Pulp 3.1+ we can introduce the
>>> FileUpload model to support chunked
>>> >     uploads. At the same time we would extend the Artifact API to
>>> accept a FileUpload id for creating an
>>> >     Artifact.
>>> >
>>> >
>>> >         On Tue, Jun 27, 2017 at 3:20 PM, Dennis Kliban <
>>> dkliban at redhat.com <mailto:dkliban at redhat.com>> wrote:
>>> >
>>> >             On Tue, Jun 27, 2017 at 2:56 PM, Brian Bouterse <
>>> bbouters at redhat.com <mailto:bbouters at redhat.com>>
>>> >             wrote:
>>> >
>>> >                 Picking up from @jortel's observations...
>>> >
>>> >                 +1 to allowing Artifacts to have an optional FK.
>>> >
>>> >                 If we have an Artifacts endpoint then we can allow for
>>> the deleting of a single artifact if it
>>> >                 has no FK. I think we want to disallow the removal of
>>> an Artifact that has a foreign key. Also
>>> >                 filtering should allow a single operation to clean up
>>> all unassociated artifacts by searching
>>> >                 for FK=None or similar.
>>> >
>>> >                 Yes, we will need to allow the single call delivering
>>> a file to also specify the relative
>>> >                 path, size, checksums etc. Since the POST body
>>> contains binary data we either need to accept
>>> >                 this data as GET style params or use a multi-part MIME
>>> upload [0]. Note that this creation of
>>> >                 an Artifact does not change the repository contents
>>> and therefore can be handled synchronously
>>> >                 outside of the tasking system.
>>> >
>>> >                 +1 to the saving of an Artifact to perform validation
>>> >
>>> >                 [0]: https://www.w3.org/Protocols/r
>>> fc1341/7_2_Multipart.html
>>> >                 <https://www.w3.org/Protocols
>>> /rfc1341/7_2_Multipart.html>
>>> >
>>> >
>>> >
>>> >                 -Brian
>>> >
>>> >
>>> >             I also support this optional FK for Artifacts and
>>> validation on save.  We should probably stick
>>> >             with accepting GET parameters for the MVP. Though
>>> multi-part MIME support would be good to
>>> >             consider for 3.1+.
>>> >
>>> >
>>> >
>>> >                 On Tue, Jun 27, 2017 at 2:44 PM, Dennis Kliban <
>>> dkliban at redhat.com
>>> >                 <mailto:dkliban at redhat.com>> wrote:
>>> >
>>> >                     On Tue, Jun 27, 2017 at 1:24 PM, Michael Hrivnak <
>>> mhrivnak at redhat.com
>>> >                     <mailto:mhrivnak at redhat.com>> wrote:
>>> >
>>> >
>>> >                         On Tue, Jun 27, 2017 at 11:27 AM, Jeff Ortel <
>>> jortel at redhat.com
>>> >                         <mailto:jortel at redhat.com>> wrote:
>>> >
>>> >
>>> >                             - The artifact FK to a content unit would
>>> need to become optional.
>>> >
>>> >                             - Need to add use cases for cleaning up
>>> artifacts not associated with a content unit.
>>> >
>>> >                             - The upload API would need additional
>>> information needed to create an artifact.
>>> >                             Like relative path, size,
>>> >                             checksums etc.
>>> >
>>> >                             - Since (I assume) you are proposing
>>> uploading/writing directly to artifact
>>> >                             storage (not staging in a working
>>> >                             dir), the flow would need to involve
>>> (optional) validation.  If validation fails,
>>> >                             the artifact must not be
>>> >                             inserted into the DB.
>>> >
>>> >
>>> >                         Perhaps a decent middle ground would be to
>>> stick with the plan of keeping uploaded (or
>>> >                         partially uploaded) files as a separate model
>>> until they are ready to be turned into a
>>> >                         Content instance plus artifacts, and save
>>> their file data directly to somewhere within
>>> >                         /var/lib/pulp/. It would be some path distinct
>>> from where Artifacts are stored. That's
>>> >                         what I had imagined we would do anyway. Then
>>> as Dennis pointed out, turning that into
>>> >                         an Artifact would only require a move
>>> operation on the same filesystem, which is
>>> >                         super-cheap.
>>> >
>>> >
>>> >                         Would that address all the concerns? We'd
>>> write the data just once, and then move it
>>> >                         once on the same filesystem. I haven't looked
>>> at django's support for this recently,
>>> >                         but it seems like it should be doable.
>>> >
>>> >                     I was just looking at the dropbox API and noticed
>>> that they provide two separate API
>>> >                     endpoints for regular file uploads[0] (< 150mb)
>>> and large file uploads[1]. It is the
>>> >                     latter that supports chunking and requires using
>>> an upload id. For the most common case
>>> >                     they support uploading a file with one API call.
>>> Our original proposal requires 2 for the
>>> >                     same use case. Pulp API users would appreciate
>>> having to only make one API call to upload
>>> >                     a file.
>>> >
>>> >                     [0] https://www.dropbox.com/develo
>>> pers-v1/core/docs#files_put
>>> >                     <https://www.dropbox.com/deve
>>> lopers-v1/core/docs#files_put>
>>> >                     [1] https://www.dropbox.com/develo
>>> pers-v1/core/docs#chunked-upload
>>> >                     <https://www.dropbox.com/deve
>>> lopers-v1/core/docs#chunked-upload>
>>> >
>>> >
>>> >
>>> >                         --
>>> >
>>> >                         Michael Hrivnak
>>> >
>>> >                         Principal Software Engineer, RHCE
>>> >
>>> >                         Red Hat
>>> >
>>> >
>>> >                         _____________________________
>>> __________________
>>> >                         Pulp-dev mailing list
>>> >                         Pulp-dev at redhat.com <mailto:
>>> Pulp-dev at redhat.com>
>>> >                         https://www.redhat.com/mailma
>>> n/listinfo/pulp-dev
>>> >                         <https://www.redhat.com/mailm
>>> an/listinfo/pulp-dev>
>>> >
>>> >
>>> >
>>> >                     _______________________________________________
>>> >                     Pulp-dev mailing list
>>> >                     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>>> >                     https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >                     <https://www.redhat.com/mailman/listinfo/pulp-dev>
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >         --
>>> >
>>> >         Michael Hrivnak
>>> >
>>> >         Principal Software Engineer, RHCE
>>> >
>>> >         Red Hat
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Pulp-dev mailing list
>>> > Pulp-dev at redhat.com
>>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>> >
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
>
> --
>
> Michael Hrivnak
>
> Principal Software Engineer, RHCE
>
> Red Hat
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170629/cf5ca57f/attachment.htm>


More information about the Pulp-dev mailing list