[Pulp-dev] proposing changes to pulp 3 upload API

Wed Jun 28 18:43:42 UTC 2017

On Wed, Jun 28, 2017 at 12:44 PM, Brian Bouterse <bbouters at redhat.com>
wrote:

> For a file to be received and saved in the right place once, we need the
> view saving the file to have all the info to form the complete path. After
> talking w/ @jortel, I think we should store Artifacts at the following path:
>
> MEDIA_ROOT/content/units/digest[0:2]/digest[2:]/<rel_path>
>
> Note that digest is the Artifact's sha256 digest. This is different from
> pulp2 which used the digest of the content unit. Note that <rel_path> would
> be provided by the user along with <size> and/or <checksum_digest>.
>
> Note that this will cause an Artifact to live in exactly one place which
> means Artifacts are now unique by digest and would need to be able to be
> associated with multiple content units. I'm not sure why we didn't do this
> before, so I'm interested in exploring issues associated with this.
>

If my memory serves me correctly we wanted to be able to have multiple
copies of an Artifact when that Artifact can be a Content Unit by itself
and also be one part of a unit. E.g.: an RPM that belong to a distribution.
I am not sure what benefit we would derive from this, but I was hoping to
jog someone's memory.

> It would be a good workflow. For a single file content unit (e.g.) rpm
> upload would be a two step process.
>
> 1. POST/PUT the file's binary data and the <relative_path> and <size>
> and/or <checksum_digest> as GET parameters
> 2. Create a content unit with the unit metadata, and 0 .. n Artifacts
> referred to by ID. This could optionally associate the new unit with one
> repository as part of the atomic unit creation.
>
> Thoughts/Ideas?
>
>
If we provide an option to combine content unit creation with repo
association, this option should allow specifying multiple repositories.
Though for the MVP, I think we should support neither. Uploading a content
unit to a particular repository would involve 3 steps.

1. POST to Artifact API endpoint with <relative_path> and <size> and/or
<checksum_digest> as GET parameters
2. POST to Content Unit API endpoint with the unit metadata, and 0 .. n
Artifacts referred to by ID.
3. POST to the Repository Content Unit  API endpoint to associate the unit
with the repository.

Step 3 would be repeated for each repository the content unit should belong
to.

> -Brian
>
>
> On Tue, Jun 27, 2017 at 4:16 PM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> On Tue, Jun 27, 2017 at 3:31 PM, Michael Hrivnak <mhrivnak at redhat.com>
>> wrote:
>>
>>> Could you re-summarize what problem would be solved by not having a
>>> FileUpload model, and giving the Artifact model the ability to have partial
>>> data and no Content foreign key?
>>>
>>> I understand the concern about where on the filesystem the data gets
>>> written and how many times, but I'm not seeing how that's related to
>>> whether we have a FileUpload model or not. Are we discussing two separate
>>> issues? 1) filesystem locations and copy efficiency, and 2) API design? Or
>>> is this discussion trying to connect them in a way I'm not seeing?
>>>
>>
>> There were two concerns: 1) Filesystem  location and copy efficiency 2)
>> API design
>>
>> The first one has been addressed. Thank you for pointing out that a
>> second write will be a move operation.
>>
>> However, I am still concerned about the complexity of the API. A
>> relatively small file should not require an upload session to be uploaded.
>> A single API call to the Artifacts API should be enough to upload a file
>> and create an Artifact from it. In Pulp 3.1+ we can introduce the
>> FileUpload model to support chunked uploads. At the same time we would
>> extend the Artifact API to accept a FileUpload id for creating an Artifact.
>>
>>
>>> On Tue, Jun 27, 2017 at 3:20 PM, Dennis Kliban <dkliban at redhat.com>
>>> wrote:
>>>
>>>> On Tue, Jun 27, 2017 at 2:56 PM, Brian Bouterse <bbouters at redhat.com>
>>>> wrote:
>>>>
>>>>> Picking up from @jortel's observations...
>>>>>
>>>>> +1 to allowing Artifacts to have an optional FK.
>>>>>
>>>>> If we have an Artifacts endpoint then we can allow for the deleting of
>>>>> a single artifact if it has no FK. I think we want to disallow the removal
>>>>> of an Artifact that has a foreign key. Also filtering should allow a single
>>>>> operation to clean up all unassociated artifacts by searching for FK=None
>>>>> or similar.
>>>>>
>>>>> Yes, we will need to allow the single call delivering a file to also
>>>>> specify the relative path, size, checksums etc. Since the POST body
>>>>> contains binary data we either need to accept this data as GET style params
>>>>> or use a multi-part MIME upload [0]. Note that this creation of an Artifact
>>>>> does not change the repository contents and therefore can be handled
>>>>> synchronously outside of the tasking system.
>>>>>
>>>>> +1 to the saving of an Artifact to perform validation
>>>>>
>>>>> [0]: https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
>>>>>
>>>>>
>>>>
>>>>> -Brian
>>>>>
>>>>
>>>> I also support this optional FK for Artifacts and validation on save.
>>>> We should probably stick with accepting GET parameters for the MVP. Though
>>>> multi-part MIME support would be good to consider for 3.1+.
>>>>
>>>>
>>>>>
>>>>> On Tue, Jun 27, 2017 at 2:44 PM, Dennis Kliban <dkliban at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jun 27, 2017 at 1:24 PM, Michael Hrivnak <mhrivnak at redhat.com
>>>>>> > wrote:
>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 27, 2017 at 11:27 AM, Jeff Ortel <jortel at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> - The artifact FK to a content unit would need to become optional.
>>>>>>>>
>>>>>>>> - Need to add use cases for cleaning up artifacts not associated
>>>>>>>> with a content unit.
>>>>>>>>
>>>>>>>> - The upload API would need additional information needed to create
>>>>>>>> an artifact.  Like relative path, size,
>>>>>>>> checksums etc.
>>>>>>>>
>>>>>>>> - Since (I assume) you are proposing uploading/writing directly to
>>>>>>>> artifact storage (not staging in a working
>>>>>>>> dir), the flow would need to involve (optional) validation.  If
>>>>>>>> validation fails, the artifact must not be
>>>>>>>> inserted into the DB.
>>>>>>>
>>>>>>>
>>>>>>> Perhaps a decent middle ground would be to stick with the plan of
>>>>>>> keeping uploaded (or partially uploaded) files as a separate model until
>>>>>>> they are ready to be turned into a Content instance plus artifacts, and
>>>>>>> save their file data directly to somewhere within /var/lib/pulp/. It would
>>>>>>> be some path distinct from where Artifacts are stored. That's what I had
>>>>>>> imagined we would do anyway. Then as Dennis pointed out, turning that into
>>>>>>> an Artifact would only require a move operation on the same filesystem,
>>>>>>> which is super-cheap.
>>>>>>>
>>>>>>>
>>>>>> Would that address all the concerns? We'd write the data just once,
>>>>>>> and then move it once on the same filesystem. I haven't looked at django's
>>>>>>> support for this recently, but it seems like it should be doable.
>>>>>>>
>>>>>>> I was just looking at the dropbox API and noticed that they provide
>>>>>> two separate API endpoints for regular file uploads[0] (< 150mb) and large
>>>>>> file uploads[1]. It is the latter that supports chunking and requires using
>>>>>> an upload id. For the most common case they support uploading a file with
>>>>>> one API call. Our original proposal requires 2 for the same use case. Pulp
>>>>>> API users would appreciate having to only make one API call to upload a
>>>>>> file.
>>>>>>
>>>>>> [0] https://www.dropbox.com/developers-v1/core/docs#files_put
>>>>>> [1] https://www.dropbox.com/developers-v1/core/docs#chunked-upload
>>>>>>
>>>>>>
>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Michael Hrivnak
>>>>>>>
>>>>>>> Principal Software Engineer, RHCE
>>>>>>>
>>>>>>> Red Hat
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pulp-dev mailing list
>>>>>>> Pulp-dev at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Michael Hrivnak
>>>
>>> Principal Software Engineer, RHCE
>>>
>>> Red Hat
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/ebf07535/attachment.htm>