[Pulp-dev] proposing changes to pulp 3 upload API

Thu Jun 29 20:51:02 UTC 2017

There is really one practical issue that is driving this convo (I think):
Django's file upload handling wants to save a file when we receive it. We
also don't want to be moving around files. Therefore we must save the file
in the right place on the first save().

So given ^, the question reduces to: "Where do we want to save a file that
backs an Artifact?" We can do that one of two ways: randomly or orderly.
Randomly would be inventing a uuid for each file and having that make the
path to the file unique. An orderly way of doing it would be to have an
digest be used instead of a uuid. Here are some path examples:

random_path_example (random uuid):    MEDIA_ROOT/artifact/uuid[0:2]/uuid[2:]
orderly_path_example (sha256 is the binary's digest):
MEDIA_ROOT/artifact/digest[0:2]/digest[2:]

Random assignment is straightforward, and it also allows one Artifact to
serve exactly one content unit allowing CASCADE delete's to handle cleanup
easily. The problem with random assignment is that it prevents an important
down-the-road use case:  "as a user who has a file backup but not a
database backup, I can recover my data without having to re-download all of
my content from remotes". Specifically, if Artifact's paths are randomly
chosen at upload time then if someone hands you a disk of Artifacts and
asks you to sync EPEL, there is no way Pulp can reasonably recognize
content it has on disk as already existing there.

This is where content addressable storage comes in. If the remoteArtifact
has the sha256 hash value set from the remote metadata that was fetched,
Pulp's changesets could recognize data on disk as already downloaded. A
random layout can never do that. A tertiary outcome of using Content
Addressable Store is that now each file backing an Artifact can only be
stored on the filesystem. I say "tertiary outcome" and not "downside"
because even though it's harder for us to implement, users would definitely
see it as a benefit that Pulp can't duplicate content at an Architectural
level.

Please send thoughts/ideas.

-Brian

On Thu, Jun 29, 2017 at 9:16 AM, Michael Hrivnak <mhrivnak at redhat.com>
wrote:

> Thanks for that explanation. That makes sense. I would describe this as
> saying there is a many-to-many relationship between Content and Artifact,
> and the ContentArtifact is the "glue" or "through" table.
>
> And again to just understand why... are we deliberately trying to
> prioritize a use case where one artifact is shared by multiple Content
> units? Can someone talk about the pros and cons of that within the context
> of this proposal? I expect it could save a small portion of disk space, but
> maybe not very much. Pulp does a pretty good job of de-duplicating at the
> Content level. Changing to a m2m relationship would definitely add more
> complexity though, and that's the aspect I'm interested in comparing to
> what value we are seeking.
>
> Separate from that relationship question, we have a use case that the
> direct-to Artifact workflow does not cover. There are multiple unit types
> where a user wants to upload a single file that represents multiple content
> units, and let pulp create one or more content units based on that file.
> For example, a docker manifest and its blobs all get saved to disk together
> (by a separate tool) and then uploaded as a tarball that Pulp can receive
> and process together. We could ask the upload client to open up the tarball
> and upload files individually I suppose. That puts more burden on the
> client though.
>
> Another example: a user can upload a comps.xml file, and pulp will parse
> it to create as many units as it finds in the XML. Pulp does not keep that
> comps.xml file, so in the proposed workflow, it would need to delete the
> Artifact at the end. It seems unexpected to utilize an Artifact as
> temporary storage in this way.
>
> I suspect we'll find more use cases like this. Thoughts? Is the FileUpload
> really worth eliminating? What I like about the current upload workflow,
> and the FileUpload workflow, is that it allows the plugin to receive any
> file or set of files that make sense within its domain, and then use that
> set of files to create units however it sees fit. It is difficult to get
> more prescriptive than that at the platform/core level.
>
> On Thu, Jun 29, 2017 at 8:47 AM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> On Thu, Jun 29, 2017 at 7:40 AM, Michael Hrivnak <mhrivnak at redhat.com>
>> wrote:
>>
>>>
>>> On Thu, Jun 29, 2017 at 7:22 AM, Dennis Kliban <dkliban at redhat.com>
>>> wrote:
>>>
>>>>
>>>> The many to many relationship is between Artifact and ContentArtifact.
>>>> This allows a content unit to have multiple Artifacts associated with it.
>>>>
>>>
>>> Could you elaborate on this? A content unit can have multiple artifacts
>>> just by artifact having a foreign key to a content unit. That's the
>>> one-to-many relationship we have on the model now in 3.0-dev.
>>>
>>> Also, what is a ContentArtifact?
>>>
>>>
>> Here are some definitions for the new proposal:
>>
>>    - Artifact - a file stored in pulp
>>    - Content - a named collection of 0 or more Artifacts that can be
>>    associated with a repository as a single unit
>>    - ContentArtifact - a relationship between an Artifact and Content.
>>    There is 0 or more ContentArtifacts for each Content.
>>    - Repository - A named collection of content.
>>    - RepositoryContent - a relationship between Content and Repository.
>>
>>
>> In the proposal we have in the MVP we have the following:
>>
>>    - FileUpload - Uploaded file that is used to create Artifacts and is
>>    then removed (definition for this is not present in the glossary of MVP)
>>    - Artifact - A file associated with one content (unit). Artifacts are
>>    not shared between content (units). Create a content unit using an uploaded
>>    file ID as the source for its metadata. Create Artifacts associated with
>>    the content unit using an uploaded file ID for each; commit as a single
>>    transaction.
>>    - Content (unit) - A single piece of content manged by Pulp. Each
>>    file associated with a content (unit) is called an Artifact. Each content
>>    (unit) may have zero or many Artifacts.
>>    - Repository - A named collection of content.
>>    - RepositoryContent - a relationship between Content and Repository
>>    (also not in the glossary of the MVP)
>>
>> In the MVP in order to add a unit to a repository, a user would:
>>
>>    1. Create a FileUpload by uploading a file
>>    2. Create an Artifact and a Content with one API call
>>    3. Associate a Content with a Repository
>>    4. Delete the FileUpload (or some cleanup job would do that for the
>>    user)
>>
>> The newly proposed workflow:
>>
>>    1. Create an Artifact by uploading a file
>>    2. Create a Content by specifying which Artifact(s) belongs to the
>>    Content and their relative paths inside the unit. This creates
>>    ContentArtifacts for each relationship.
>>    3. Associate a Content with a repository.
>>
>> In the MVP workflow, once an FileUpload is deleted, it's hard to create
>> another Content from that file. I am sure we can come up with a way to do
>> it, but it won't be as straight forward as the above workflow.
>>
>>
>>
>>>
>>> --
>>>
>>> Michael Hrivnak
>>>
>>> Principal Software Engineer, RHCE
>>>
>>> Red Hat
>>>
>>
>>
>
>
> --
>
> Michael Hrivnak
>
> Principal Software Engineer, RHCE
>
> Red Hat
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170629/eb38b879/attachment.htm>