[Pulp-dev] uniqueness constraints within a repository version

Tatiana Tereshchenko ttereshc at redhat.com
Tue Jul 30 13:48:17 UTC 2019

After some offline discussion with several Pulp devs, we decided to
dedicate this thread to one problem - duplicates (and move the other
problem - filtering/validation - to a different thread).

The current proposal is to have a repo_key on a content model (thanks,
Simon) and ensure its uniqueness at repository version creation time in
I updated the issue https://pulp.plan.io/issues/5008 accordingly, see
'Solution' section of the description.
Please share your thoughts or concerns, preferably on the issue.

Thank you,

On Wed, Jul 24, 2019 at 10:02 PM Brian Bouterse <bbouters at redhat.com> wrote:

> On Mon, Jul 22, 2019 at 4:47 AM Tatiana Tereshchenko <ttereshc at redhat.com>
> wrote:
>> On Sun, Jul 21, 2019 at 3:00 PM Brian Bouterse <bbouters at redhat.com>
>> wrote:
>>> On Sun, Jul 21, 2019 at 6:23 AM Tatiana Tereshchenko <
>>> ttereshc at redhat.com> wrote:
>>>> +1 to the idea of a repo_key.
>>>> Should we also add the ability to apply custom validation of the
>>>> content being added?
>>>> Similar to a repo_key, Content model can optionally provide an
>>>> additional validator.
>>>> Use cases:
>>>>  - for pulp_file to avoid relative path overlap - e.g. 'a/b' and 'a'
>>> In thinking this over more, I'm unsure that pulp_file has the use case.
>>> Two different Artifacts having relative paths 'a' and 'a/b' in one repo
>>> version doesn't seem problematic. This problem statement is similar to the
>>> Distribution.base_path overlap problem statement where it's unavoidably
>>> ambiguous which Distribution should be matched when base_paths are allowed
>>> to overlap. In this case for pulp_file, it's not ambiguous in the same way,
>>> the relative_path I expect to match to exactly 1 content unit either 'a',
>>> or 'a/b', but not both. What do you think about this?
>> I agree that the problem is similar to the Distribution.base_path
>> overlap. If I understand you correctly, yes, it's not a problem if you
>> query content one by one.
>> What about use cases when we want to have a repo version on a filesystem?
>> E.g. Browsable repositoiries (this feature has already been asked for by
>> our stakeholders), export (e.g. rsync).
> I see now that the exporting to a filesystem use case make any repo
> containing units 'a' and 'a/b' because 'a' can't be both a directory and a
> file on POSIX filesystems. We should identify this requirement in our
> plugin docs somehow.
> I was thinking webservers themselves are in the same situation. They have
> to store content on disk, but they also have to serve 'a' and 'a/b' to the
> user. To make this work they have 'a' actually serve up index.html. In
> practice the binary data served at 'a' isn't stored at 'a' on the
> filesystem but actually 'a/index.html' or 'a/index.txt' or something. The
> data you would serve up as a directory's response would be an index of the
> directory so this makes logical sense to me also. At export time one option
> is to translate 'a' to 'a/index.html' which in terms of having webservers
> serving it back up is kind of necessary. What do you think?
> If ^ is what we do then the overlapping 'a' and 'a/b' I think would be ok
> again. What do you think about this?
>>>  - for pulp_rpm to filter by signature/signing key
>>> Can we expand on this use case a bit? Is it that the repo version should
>>> only contains units signed or unsigned rpms? Or is it that we are ok with a
>>> mixture as long as each NEVRA is unique? I suspect the former, but I want
>>> to be sure.
>> I think it should contain only signed units and optionally signed by
>> specific keys only. See pulp 2 feature description
>> https://docs.pulpproject.org/plugins/pulp_rpm/user-guide/features.html#package-signatures-and-gpg-key-id-filtering
>> Another use case which comes to mind is: keeping the last N versions of a
>> unit within a repo verison.
> I agree w/ this use case. In the context of repo_key, I don't see a
> relationship between the repo_key mechanism for preserving repo-level
> uniqueness and how it would also resolve this. Did you have an idea that
> could handle all of these cases with one mechanism?
>> Tanya
>>>> Plugins can solve it by defining their own stage but it seems like
>>>> almost any plugin needs to ensure absence of collisions specific to it,
>>>> even the simple pulp_file.
>>>> It means that our default pipeline becomes less useful and will be
>>>> hardly ever used by any [currently known] plugins.
>>>> Any thoughts?
>>>> Tanya
>>>> On Mon, Jul 8, 2019 at 9:09 PM Brian Bouterse <bbouters at redhat.com>
>>>> wrote:
>>>>> I want to retell Simon's proposal to have "Content defines a
>>>>> 'repo_key' similar to a unit_key. This key must be unique within a repo
>>>>> version (and not globally like the unit_key."
>>>>> We could adopt his proposal to have the repo_key tuple defined on
>>>>> Content in pulpcore. If we left the add/remove APIs in core and adopt for
>>>>> both sync and add/remove a "keep newest to associate" functionality
>>>>> described earlier in the thread. This "keep newest to associate" code would
>>>>> be used by sync in the form of a core stage that is a generalized version
>>>>> of the RemoveDuplicates stage. This would become part of the default
>>>>> pipeline for all users of Stages API. I think this would be better than
>>>>> plugin writers implementing it over and over and also less effort for
>>>>> plugin wrtiers. This design would meet the current needs of pulp_cookbook,
>>>>> pulp_file, and pulp_rpm which are the only 3 places I know we have this
>>>>> problem so far, but I believe more content types are susceptible to this.
>>>>> What do you think we should do?
>>>>> Thanks!
>>>>> Brian
>>>>> On Thu, Jun 27, 2019 at 4:03 AM Tatiana Tereshchenko <
>>>>> ttereshc at redhat.com> wrote:
>>>>>> Sure, the code can be de-duplicated.
>>>>>> My main worry is that it's a responsibility of a plugin writer not to
>>>>>> forget to ensure uniqueness constraints within a repo version for every
>>>>>> workflow (sync, copy, anything else) where a repo version is created.
>>>>>> Every time before RepositoryVersion.create() is called, there should
>>>>>> be a check that there is no colliding content in a repo version.
>>>>>> It would be much more reliable and friendly, in my opinion, if plugin
>>>>>> writer could define rules/callbacks/whatever for each content and it would
>>>>>> be applied to any repository creation.
>>>>>> At the same time this eliminates the flexibility to define different
>>>>>> logic for content collision for different workflows, however I'm not sure
>>>>>> if such a use case exists or is desired.
>>>>>> Tanya
>>>>>> On Wed, Jun 26, 2019 at 6:49 PM Austin Macdonald <amacdona at redhat.com>
>>>>>> wrote:
>>>>>>> @Tanya Tereshchenko <ttereshc at redhat.com>
>>>>>>>> Do I understand correctly that it doesn't cover the sync case and
>>>>>>>> it's only about explicit repo version creation?
>>>>>>> I don't mean that add/remove could not share code with remove
>>>>>>> duplicate stage. I wanted to point out that we have a problem here (how to
>>>>>>> remove duplicates) that has similar patterns to other problems with add
>>>>>>> remove (recursive, copy, deciding which content to keep with a collision,
>>>>>>> etc.) I don't doubt that pulpcore could help solve these problems, but I
>>>>>>> think that as we approach our GA, we should consider solving this problem
>>>>>>> (for now) by getting out of the way of plugin writers rather than by
>>>>>>> implementing code that is supposed to work for all plugins. I suspect that
>>>>>>> plenty of the plugins will be implementing their own add/remove anyway.
>>>>>>> On Tue, Jun 25, 2019 at 12:56 PM David Davis <daviddavis at redhat.com>
>>>>>>> wrote:
>>>>>>>> I don't think this solution would work in the case of creating a
>>>>>>>> new repository version. Suppose for example you had two content units that
>>>>>>>> collide, one in a repo version and one older unit that a user explicitly
>>>>>>>> wants to add to the repo version. If the latter one is older, then what
>>>>>>>> would happen?
>>>>>>>> David
>>>>>>>> On Tue, Jun 25, 2019 at 12:48 PM Brian Bouterse <
>>>>>>>> bbouters at redhat.com> wrote:
>>>>>>>>> Having a way for units to express their uniqueness per repo sounds
>>>>>>>>> good because then more areas of Pulp's code could answer the question:
>>>>>>>>> "will I have a duplicate if I add content X to repo_version Y".
>>>>>>>>> Let's assume we know that situation is about to occur during sync
>>>>>>>>> for example, what do we do about it? In the errata case we know the "new"
>>>>>>>>> one should replace the existing one. Maybe we start to 'order' the units
>>>>>>>>> with colliding repo keys and keep the newest one always? Would this work
>>>>>>>>> for pulp_cookbook and pulp_rpm? Would it generalize? Is this what you
>>>>>>>>> imagined?
>>>>>>>>> On Tue, Jun 25, 2019 at 5:30 AM Tatiana Tereshchenko <
>>>>>>>>> ttereshc at redhat.com> wrote:
>>>>>>>>>> Do I understand correctly that it doesn't cover the sync case and
>>>>>>>>>> it's only about explicit repo version creation?
>>>>>>>>>> So the suggestion is to implement the same logic twice: for sync
>>>>>>>>>> case - RemoveDuplicates stage and/or maybe some custom stage (e.g. to
>>>>>>>>>> disallow overlapping paths), and for direct repo version creation - your
>>>>>>>>>> proposal.
>>>>>>>>>> On Mon, Jun 24, 2019 at 3:13 PM Austin Macdonald <
>>>>>>>>>> amacdona at redhat.com> wrote:
>>>>>>>>>>> I have a design in mind for solving this problem:
>>>>>>>>>>> 1. Remove POST to RepositoryVersion (no general add/remove
>>>>>>>>>>> endpoint).
>>>>>>>>>>> 2. Add an endpoint to kick off an add/remove task, namespaced by
>>>>>>>>>>> plugin. ie `POST pulp/api/v3/docker/add-remove/`
>>>>>>>>>>>    This view can be provided to all plugins by the plugin
>>>>>>>>>>> template, and will be based on the current RepositoryVersionCreate:
>>>>>>>>>>> https://github.com/pulp/pulpcore/blob/master/pulpcore/app/viewsets/repository.py#L221-L258
>>>>>>>>>>>    Note: the main purpose of this view is to kick off the
>>>>>>>>>>> general add/remove task, which will be unchanged:
>>>>>>>>>>> https://github.com/pulp/pulpcore/blob/master/pulpcore/app/tasks/repository.py#L70
>>>>>>>>>>> 3. Add an add/remove serializer to the plugin API.
>>>>>>>>>>> 3. Plugins needing further customization can provide their own
>>>>>>>>>>> task and subclassed serializer.
>>>>>>>>>>> This gives the plugin writer full control over the endpoint
>>>>>>>>>>> (customizable arguments and validation), and full control over the flow
>>>>>>>>>>> (extra logic, depsolving, enforced uniqueness). It only uses the existing
>>>>>>>>>>> patterns (and existing required knowledge), but requires no work (other
>>>>>>>>>>> than using the template) for the simple case.
>>>>>>>>>>> On Mon, Jun 3, 2019 at 2:56 PM Simon Baatz <gmbnomis at gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> On Mon, Jun 03, 2019 at 09:11:07AM -0400, David Davis wrote:
>>>>>>>>>>>> >    @Simon I like the idea behind the repo_key solution you
>>>>>>>>>>>> came up with.
>>>>>>>>>>>> >    Can you be more specific around cases you think that it
>>>>>>>>>>>> couldn't
>>>>>>>>>>>> >    handle? I imagine that plugin writers could use properties
>>>>>>>>>>>> or
>>>>>>>>>>>> >    denormailzation (ie additional database columns) to solve
>>>>>>>>>>>> cases where
>>>>>>>>>>>> >    they need uniqueness across data that isn't in the
>>>>>>>>>>>> database. In a worst
>>>>>>>>>>>> >    case scenario, they can't use the pulpcore solution and
>>>>>>>>>>>> just have to
>>>>>>>>>>>> >    roll their own.
>>>>>>>>>>>> What I wrote probably sounded too pessimistic. You are right, in
>>>>>>>>>>>> most cases that should be doable.
>>>>>>>>>>>> I agree that we could have a simple default solution that just
>>>>>>>>>>>> requires to specify a couple of field names in the easiest
>>>>>>>>>>>> case.  As you
>>>>>>>>>>>> say, it should be possible use custom logic in a plugin if
>>>>>>>>>>>> required.
>>>>>>>>>>>> Here is the case I was thinking of that it can't handle:
>>>>>>>>>>>> In pulp_file, a uniqueness constraint on "relative_path" would
>>>>>>>>>>>> allow
>>>>>>>>>>>> content units "a" and "a/b" to be in a repo version.
>>>>>>>>>>>> However, we may want file repos to be representable on an
>>>>>>>>>>>> actual file
>>>>>>>>>>>> system (e.g. when exporting them as tar files).  For the repo
>>>>>>>>>>>> above,
>>>>>>>>>>>> this does not work, as "a" can't be a file and a directory at
>>>>>>>>>>>> the
>>>>>>>>>>>> same time on a standard Unix file system.
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> Pulp-dev mailing list
>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>> _______________________________________________
>>>>>>>> Pulp-dev mailing list
>>>>>>>> Pulp-dev at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>> _______________________________________________
>>>>>> Pulp-dev mailing list
>>>>>> Pulp-dev at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190730/cddfd805/attachment.htm>

More information about the Pulp-dev mailing list