[Pulp-dev] the "relative path" problem

Brian Bouterse bmbouter at redhat.com
Thu May 7 12:06:14 UTC 2020


I agree with that problem statement. pulp_file may want to have the same
Content at two different paths in different RepositoryVersions (or even the
same RepositoryVersion). Without this capability a user could never "move"
where content lives in a RepositoryVersion if its already been placed in
any other RepositoryVersion.

Additionally pulp_maven may need to sync two repositories in the wild that
already contain the same content in two locations. I offer this as example
not to pile-on, but because it's a multi-content artifact which I believe
we will need to consider also as we work towards a solution.

I've been spending time on developing a solution, but it needs more work so
it's not ready yet. Also other katello and galaxy_ng work continues to
pre-empt this, so it could take a while.

On Thu, May 7, 2020 at 3:39 AM Matthias Dellweg <mdellweg at redhat.com> wrote:

> > Users need to be able to store the same content unit at different
> relative paths in different repository versions. This problem is not unique
> to the RPM plugin. Do we agree about that?
> Yes, we agree. In pulp_deb relative_path is part of the contents
> natural_key to circumvent this problem. So this creates two content units
> that only differ in relativ_path. At least they share the artifact.
>
> On Thu, May 7, 2020 at 2:06 AM Dennis Kliban <dkliban at redhat.com> wrote:
>
>> I'd like to provide a little bit more context for my previous email by
>> going back to the original problem statement:
>>
>> On Wed, Apr 1, 2020 at 9:23 AM Daniel Alley <dalley at redhat.com> wrote:
>>
>>> Problem:
>>>
>>> Currently, a relative_path is tied to content in Pulp. This means that
>>> if a content unit exists in two places within a repository or across
>>> repositories, it has to be stored as two separate content units. This
>>> creates redundant data and potential confusion for users.
>>>
>>> As a specific example, we need to support mirroring content in pulp_rpm
>>> <https://pulp.plan.io/issues/6353>. Currently, for each location at
>>> which a single package is stored, we’ll need to create a content unit. We
>>> could end up with several records representing a single package. Users may
>>> be confused about why they see multiple records for a package and they may
>>> have trouble for example deciding which content unit to copy.
>>>
>> Users need to be able to store the same content unit at different
>> relative paths in different repository versions. This problem is not unique
>> to the RPM plugin. Do we agree about that?
>>
>> I've been working on a potential solution that solves this problem in a
>> document[0]. It is a complicated change and the document does not fully
>> capture the plan yet. Feedback and help on the design is welcome.
>>
>> [0] https://hackmd.io/02KBjCD3Q0WP7p4ALwzhJw?edit
>>
>>
>> On Mon, May 4, 2020 at 4:11 PM Dennis Kliban <dkliban at redhat.com> wrote:
>>
>>> I've reached two conclusions while trying to formulate a solution:
>>>
>>> This problem needs to be solved at the repository version level.
>>> Repository membership needs to be tracked at the artifact level, and not
>>> content level as it is now.
>>>
>>> On Thu, Apr 30, 2020 at 1:11 PM Daniel Alley <dalley at redhat.com> wrote:
>>>
>>>> Cool, so the only difference is whether to try to store the
>>>> relationship in the DB, or leverage the fact that we already have the
>>>> metadata and just re-parse it.
>>>>
>>>> I know the latter approach has yet to be written up, but my concern
>>>> there is that adding another layer of indirection between "repository
>>>> version" and "content" is going to have an adverse impact on performance,
>>>> since it is already the most complex and demanding query we issue to the DB
>>>> and one of the most common and important.
>>>>
>>>> On Thu, Apr 30, 2020 at 12:50 PM David Davis <daviddavis at redhat.com>
>>>> wrote:
>>>>
>>>>> Yes but I was imagining the mapping would be stored not as Content but
>>>>> as a separate object. So we wouldn't use filename for the mapping (rather
>>>>> we'd use ContentArtifact pk) and  we wouldn't need to change
>>>>> ContentArtifact's relative_path at all. That said, I think your solution
>>>>> captures the idea though and is better in some ways.
>>>>>
>>>>> Changing the RepositoryContent model to point to ContentArtifacts and
>>>>> store relative_paths is probably the best and most correct solution in
>>>>> theory. However, it's going to be painful to implement for both core and
>>>>> plugins.
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> On Thu, Apr 30, 2020 at 12:33 PM Daniel Alley <dalley at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> @David Davis <daviddavis at redhat.com>  so this proposal would go
>>>>>> something like this, correct?:
>>>>>>
>>>>>> * For the signed metadata / exact mirror use-case we need to store
>>>>>> the repository metadata itself as a content unit inside the
>>>>>> RepositoryVersion anyway (because the hash must be equal)
>>>>>> * Because we have this metadata lying around, we can reference it at
>>>>>> publish time to discover the appropriate PublishedArtifact.relative_path
>>>>>>    * Create a map of "filename" -> "location_href" and look up the
>>>>>> filename of each RPM package to find the appropriate path
>>>>>>    * This should be pretty fast for the RPM plugin since createrepo_c
>>>>>> is doing all the hard work
>>>>>> * Data migration to ensure ContentArtifact.relative_path is only
>>>>>> storing the filename (and I would suggest we also change the name to
>>>>>> "filename")
>>>>>> * If metadata isn't present in the RepositoryVersion, then just tweak
>>>>>> the PublishedArtifact.relative_path so that it uses whichever our default
>>>>>> repo layout is
>>>>>>
>>>>>> On Tue, Apr 28, 2020 at 11:41 AM David Davis <daviddavis at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, that's correct. During our meeting we discussed two options:
>>>>>>> the first was to extend RepositoryContent to store relative path per
>>>>>>> ContentArtifact as storing a relative_path per Content won't work for
>>>>>>> multi-Artifact Content units.
>>>>>>>
>>>>>>> An alternative that I pitched was to have plugins (or maybe even
>>>>>>> core someday) store this information outside RepositoryContent and then use
>>>>>>> this information during publishing to set relative_path on
>>>>>>> PublishedArtifacts. We'd have to modify the content app if we wanted to
>>>>>>> support pass through publications but I think asking plugins to use
>>>>>>> published artifacts in this case is warranted. That said, I don't think
>>>>>>> anyone else was keen on this idea though.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 28, 2020 at 10:30 AM Matthias Dellweg <
>>>>>>> mdellweg at redhat.com> wrote:
>>>>>>>
>>>>>>>> That is only used for passthrough publication afaik. If you publish
>>>>>>>> each content unit "by hand", you create a new relative path for each
>>>>>>>> published artifact. That is, why it can be empty and still the content can
>>>>>>>> be published.
>>>>>>>>
>>>>>>>> On Tue, Apr 28, 2020 at 4:09 PM Daniel Alley <dalley at redhat.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We realized in our discussion that the original proposal described
>>>>>>>>> in my email will not work, because "relative_path" ultimately describes the
>>>>>>>>> path of the published *artifacts* (not content), and for content
>>>>>>>>> types with multiple artifacts, storing this information in a field on
>>>>>>>>> RepositoryContent would not be possible.
>>>>>>>>>
>>>>>>>>> On Mon, Apr 27, 2020 at 6:08 PM Daniel Alley <dalley at redhat.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> There is a video call scheduled to discuss this issue tomorrow
>>>>>>>>>> (Tuesday April 28th) at 13:30 UTC (please convert to your local time).
>>>>>>>>>> https://meet.google.com/scy-csbx-qiu
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 25, 2020 at 7:02 AM David Davis <
>>>>>>>>>> daviddavis at redhat.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I had a chance to think about this some more yesterday and
>>>>>>>>>>> wanted to email out my thoughts. I also think that this change sounds scary
>>>>>>>>>>> and will have a big impact on plugin writers so I thought of a couple
>>>>>>>>>>> alternatives:
>>>>>>>>>>>
>>>>>>>>>>> First, we could add a relative_path field to RepositoryContent
>>>>>>>>>>> instead of moving it there. This would be an optional field. It would be up
>>>>>>>>>>> to plugins to manage this field and they would still need to populate the
>>>>>>>>>>> relative_path field on ContentArtifact. But plugins could use this optional
>>>>>>>>>>> field to store relative paths per repository and then use this field when
>>>>>>>>>>> generating publications.
>>>>>>>>>>>
>>>>>>>>>>> The second alternative is one that is already laid out in the
>>>>>>>>>>> original email but to call it out again: it would be to not solve this in
>>>>>>>>>>> pulpcore. RPM would create its own object that would map content in a
>>>>>>>>>>> repository to relative_paths.
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 21, 2020 at 9:22 AM Quirin Pamp <pamp at atix.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am not currently very well versed in the classes involved,
>>>>>>>>>>>> but moving relative_path around sounds slightly scary with the potential to
>>>>>>>>>>>> break things.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> As such, I would be interested to be kept in the loop as this
>>>>>>>>>>>> moves forward. (Mailing list once there is some movement is entirely
>>>>>>>>>>>> sufficient 😉)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Quirin Pamp
>>>>>>>>>>>> ------------------------------
>>>>>>>>>>>> *From:* pulp-dev-bounces at redhat.com <
>>>>>>>>>>>> pulp-dev-bounces at redhat.com> on behalf of Ina Panova <
>>>>>>>>>>>> ipanova at redhat.com>
>>>>>>>>>>>> *Sent:* 21 April 2020 14:07:13
>>>>>>>>>>>> *To:* Daniel Alley <dalley at redhat.com>
>>>>>>>>>>>> *Cc:* Pulp-dev <pulp-dev at redhat.com>
>>>>>>>>>>>> *Subject:* Re: [Pulp-dev] the "relative path" problem
>>>>>>>>>>>>
>>>>>>>>>>>> Daniel,
>>>>>>>>>>>>
>>>>>>>>>>>> how about setting up a meeting and brainstorm the alternatives,
>>>>>>>>>>>> pros/cons there?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --------
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Ina Panova
>>>>>>>>>>>> Senior Software Engineer| Pulp| Red Hat Inc.
>>>>>>>>>>>>
>>>>>>>>>>>> "Do not go where the path may lead,
>>>>>>>>>>>>  go instead where there is no path and leave a trail."
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 17, 2020 at 5:57 PM Daniel Alley <dalley at redhat.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Bump, this item needs to move forwards soon.  Does anyone have
>>>>>>>>>>>> any thoughts?
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 1, 2020 at 9:40 AM Pavel Picka <ppicka at redhat.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I'd like to add one more question to this topic. Do you think
>>>>>>>>>>>> it is a blocker for PRs [0] & [1] as by testing [2] this features I haven't
>>>>>>>>>>>> run into real world example where two really same name packages appears.
>>>>>>>>>>>> I think this is a 'must have' feature but until we solve/decide
>>>>>>>>>>>> it we can have two features working may with warning in docs for users that
>>>>>>>>>>>> can happen in some 'special' repositories.
>>>>>>>>>>>>
>>>>>>>>>>>> To follow topic directly I like proposed move to
>>>>>>>>>>>> 'RepositoryContent' and add it to its uniqueness constraint (if I
>>>>>>>>>>>> understand well).
>>>>>>>>>>>>
>>>>>>>>>>>> [0] https://github.com/pulp/pulp_rpm/pull/1657
>>>>>>>>>>>> [1] https://github.com/pulp/pulp_rpm/pull/1642
>>>>>>>>>>>> [2] tested with centos 7, 8, opensuse and SLE repositories
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Apr 1, 2020 at 3:22 PM Daniel Alley <dalley at redhat.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> We'd like to start a discussion on the "relative path problem"
>>>>>>>>>>>> identified recently.
>>>>>>>>>>>> Problem:
>>>>>>>>>>>>
>>>>>>>>>>>> Currently, a relative_path is tied to content in Pulp. This
>>>>>>>>>>>> means that if a content unit exists in two places within a repository or
>>>>>>>>>>>> across repositories, it has to be stored as two separate content units.
>>>>>>>>>>>> This creates redundant data and potential confusion for users.
>>>>>>>>>>>>
>>>>>>>>>>>> As a specific example, we need to support mirroring content in
>>>>>>>>>>>> pulp_rpm <https://pulp.plan.io/issues/6353>. Currently, for
>>>>>>>>>>>> each location at which a single package is stored, we’ll need to create a
>>>>>>>>>>>> content unit. We could end up with several records representing a single
>>>>>>>>>>>> package. Users may be confused about why they see multiple records for a
>>>>>>>>>>>> package and they may have trouble for example deciding which content unit
>>>>>>>>>>>> to copy.
>>>>>>>>>>>> Proposed Solution:
>>>>>>>>>>>>
>>>>>>>>>>>> Move “relative_path” from its current location on
>>>>>>>>>>>> ContentArtifact, to RepositoryContent. This will require a sizable data
>>>>>>>>>>>> migration. It is possibly the case that in rare cases, repository versions
>>>>>>>>>>>> may change slightly due to deduplication.
>>>>>>>>>>>>
>>>>>>>>>>>> A repository-version-wide uniqueness constraint will be present
>>>>>>>>>>>> on “relative_path”, independently of any other repository uniquness
>>>>>>>>>>>> constraints (repo_key_fields) defined by the plugin writer.
>>>>>>>>>>>>
>>>>>>>>>>>> Modify the Stages API so that the relative_path can be
>>>>>>>>>>>> processed in the correct location – instead of “DeclarativeArtifact” it
>>>>>>>>>>>> will likely need to go on “DeclarativeContent”
>>>>>>>>>>>>
>>>>>>>>>>>> Remove “location_href” from the RPM Package content model – it
>>>>>>>>>>>> was never a true part of the RPM (file) metadata, it is derived from the
>>>>>>>>>>>> repository metadata. So storing it as a part of the Content unit doesn’t
>>>>>>>>>>>> entirely make sense.
>>>>>>>>>>>> Alternatives
>>>>>>>>>>>>
>>>>>>>>>>>> In most cases, a content unit will have a single relative path
>>>>>>>>>>>> for a content unit. Creating a general solution to solve a one-off problem
>>>>>>>>>>>> is usually not a good idea. As an alternative, we could look at another
>>>>>>>>>>>> solution for mirroring content. One example might be to create a new object
>>>>>>>>>>>> (e.g. RpmRepoMirrorContentMapping) that maps content to specific paths
>>>>>>>>>>>> within a repo or repo version.
>>>>>>>>>>>> Questions
>>>>>>>>>>>>
>>>>>>>>>>>>    - How do we handle this in pulp_file? How are content units
>>>>>>>>>>>>    identified in pulp_file without relative_path?
>>>>>>>>>>>>       - Checksum?
>>>>>>>>>>>>       - How was this problem handled in Pulp 2?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Please weigh in if you have any input on potential problems
>>>>>>>>>>>> with the proposal, potential alternate solutions, or other insights or
>>>>>>>>>>>> questions!
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Pavel Picka
>>>>>>>>>>>> Red Hat
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Pulp-dev mailing list
>>>>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>> Pulp-dev mailing list
>>>>>>>>> Pulp-dev at redhat.com
>>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20200507/d4863727/attachment.htm>


More information about the Pulp-dev mailing list