[Pulp-dev] proposing changes to pulp 3 upload API

Wed Jun 28 16:44:29 UTC 2017

For a file to be received and saved in the right place once, we need the
view saving the file to have all the info to form the complete path. After
talking w/ @jortel, I think we should store Artifacts at the following path:

MEDIA_ROOT/content/units/digest[0:2]/digest[2:]/<rel_path>

Note that digest is the Artifact's sha256 digest. This is different from
pulp2 which used the digest of the content unit. Note that <rel_path> would
be provided by the user along with <size> and/or <checksum_digest>.

Note that this will cause an Artifact to live in exactly one place which
means Artifacts are now unique by digest and would need to be able to be
associated with multiple content units. I'm not sure why we didn't do this
before, so I'm interested in exploring issues associated with this.

It would be a good workflow. For a single file content unit (e.g.) rpm
upload would be a two step process.

1. POST/PUT the file's binary data and the <relative_path> and <size>
and/or <checksum_digest> as GET parameters
2. Create a content unit with the unit metadata, and 0 .. n Artifacts
referred to by ID. This could optionally associate the new unit with one
repository as part of the atomic unit creation.

Thoughts/Ideas?

-Brian

On Tue, Jun 27, 2017 at 4:16 PM, Dennis Kliban <dkliban at redhat.com> wrote:

> On Tue, Jun 27, 2017 at 3:31 PM, Michael Hrivnak <mhrivnak at redhat.com>
> wrote:
>
>> Could you re-summarize what problem would be solved by not having a
>> FileUpload model, and giving the Artifact model the ability to have partial
>> data and no Content foreign key?
>>
>> I understand the concern about where on the filesystem the data gets
>> written and how many times, but I'm not seeing how that's related to
>> whether we have a FileUpload model or not. Are we discussing two separate
>> issues? 1) filesystem locations and copy efficiency, and 2) API design? Or
>> is this discussion trying to connect them in a way I'm not seeing?
>>
>
> There were two concerns: 1) Filesystem  location and copy efficiency 2)
> API design
>
> The first one has been addressed. Thank you for pointing out that a second
> write will be a move operation.
>
> However, I am still concerned about the complexity of the API. A
> relatively small file should not require an upload session to be uploaded.
> A single API call to the Artifacts API should be enough to upload a file
> and create an Artifact from it. In Pulp 3.1+ we can introduce the
> FileUpload model to support chunked uploads. At the same time we would
> extend the Artifact API to accept a FileUpload id for creating an Artifact.
>
>
>> On Tue, Jun 27, 2017 at 3:20 PM, Dennis Kliban <dkliban at redhat.com>
>> wrote:
>>
>>> On Tue, Jun 27, 2017 at 2:56 PM, Brian Bouterse <bbouters at redhat.com>
>>> wrote:
>>>
>>>> Picking up from @jortel's observations...
>>>>
>>>> +1 to allowing Artifacts to have an optional FK.
>>>>
>>>> If we have an Artifacts endpoint then we can allow for the deleting of
>>>> a single artifact if it has no FK. I think we want to disallow the removal
>>>> of an Artifact that has a foreign key. Also filtering should allow a single
>>>> operation to clean up all unassociated artifacts by searching for FK=None
>>>> or similar.
>>>>
>>>> Yes, we will need to allow the single call delivering a file to also
>>>> specify the relative path, size, checksums etc. Since the POST body
>>>> contains binary data we either need to accept this data as GET style params
>>>> or use a multi-part MIME upload [0]. Note that this creation of an Artifact
>>>> does not change the repository contents and therefore can be handled
>>>> synchronously outside of the tasking system.
>>>>
>>>> +1 to the saving of an Artifact to perform validation
>>>>
>>>> [0]: https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
>>>>
>>>>
>>>
>>>> -Brian
>>>>
>>>
>>> I also support this optional FK for Artifacts and validation on save.
>>> We should probably stick with accepting GET parameters for the MVP. Though
>>> multi-part MIME support would be good to consider for 3.1+.
>>>
>>>
>>>>
>>>> On Tue, Jun 27, 2017 at 2:44 PM, Dennis Kliban <dkliban at redhat.com>
>>>> wrote:
>>>>
>>>>> On Tue, Jun 27, 2017 at 1:24 PM, Michael Hrivnak <mhrivnak at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Tue, Jun 27, 2017 at 11:27 AM, Jeff Ortel <jortel at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> - The artifact FK to a content unit would need to become optional.
>>>>>>>
>>>>>>> - Need to add use cases for cleaning up artifacts not associated
>>>>>>> with a content unit.
>>>>>>>
>>>>>>> - The upload API would need additional information needed to create
>>>>>>> an artifact.  Like relative path, size,
>>>>>>> checksums etc.
>>>>>>>
>>>>>>> - Since (I assume) you are proposing uploading/writing directly to
>>>>>>> artifact storage (not staging in a working
>>>>>>> dir), the flow would need to involve (optional) validation.  If
>>>>>>> validation fails, the artifact must not be
>>>>>>> inserted into the DB.
>>>>>>
>>>>>>
>>>>>> Perhaps a decent middle ground would be to stick with the plan of
>>>>>> keeping uploaded (or partially uploaded) files as a separate model until
>>>>>> they are ready to be turned into a Content instance plus artifacts, and
>>>>>> save their file data directly to somewhere within /var/lib/pulp/. It would
>>>>>> be some path distinct from where Artifacts are stored. That's what I had
>>>>>> imagined we would do anyway. Then as Dennis pointed out, turning that into
>>>>>> an Artifact would only require a move operation on the same filesystem,
>>>>>> which is super-cheap.
>>>>>>
>>>>>>
>>>>> Would that address all the concerns? We'd write the data just once,
>>>>>> and then move it once on the same filesystem. I haven't looked at django's
>>>>>> support for this recently, but it seems like it should be doable.
>>>>>>
>>>>>> I was just looking at the dropbox API and noticed that they provide
>>>>> two separate API endpoints for regular file uploads[0] (< 150mb) and large
>>>>> file uploads[1]. It is the latter that supports chunking and requires using
>>>>> an upload id. For the most common case they support uploading a file with
>>>>> one API call. Our original proposal requires 2 for the same use case. Pulp
>>>>> API users would appreciate having to only make one API call to upload a
>>>>> file.
>>>>>
>>>>> [0] https://www.dropbox.com/developers-v1/core/docs#files_put
>>>>> [1] https://www.dropbox.com/developers-v1/core/docs#chunked-upload
>>>>>
>>>>>
>>>>>
>>>>>> --
>>>>>>
>>>>>> Michael Hrivnak
>>>>>>
>>>>>> Principal Software Engineer, RHCE
>>>>>>
>>>>>> Red Hat
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Michael Hrivnak
>>
>> Principal Software Engineer, RHCE
>>
>> Red Hat
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170628/81020b54/attachment.htm>