[Pulp-dev] Uniqueness constraints on Content in Pulp 3

Brian Bouterse bbouters at redhat.com
Tue Nov 6 16:40:35 UTC 2018

These are great questions. I'll try to keep my responses short to promote
more discussion.

On Mon, Nov 5, 2018 at 3:21 PM Simon Baatz <gmbnomis at gmail.com> wrote:

> I apologize for the lengthy post, but I did not know where to file an
> issue for
> this. It is a generic problem affecting most Pulp 3 plugins.
> I am puzzled for some time now about the natural keys used for content in
> plugins. Examples are:
> pulp_python: 'filename'
> pulp_ansible: 'version', 'role' (for role: 'namespace', 'name')
> pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release', 'arch',
> 'checksum_type', 'pkgId'
> pulp_cookbook:  'name', 'version'
> These look like keys that make sense for content in a single repo
> (version), but
> not necessarily for content in a per plugin pool of content. In an ideal
> world,
> these keys are globally unique, i.e. there is only a single "utils-0.9.0"
> Python
> module world-wide that refers to the same artifacts as the "utils-0.9.0"
> module on
> PyPi. But, as far as I know, the world is far from ideal, especially in an
> enterprise setting...
Agreed. This uniqueness is what allows Pulp to recognize and deduplicate
content in its database. On the filesystem the content addressable storage
will store identical assets only once, but if Pulp couldn't recognize
"utils-0.9.0" from one repo as the same as "utils-0.9.0" then each
sync/upload makes all new content units each time.

> With the current implementation, the following scenarios could happen if I
> got
> it right:
> 1. In Acme Corp, a team develops a Python module/Ansible role/Chef cookbook
>    called "acme_utils", which is part of a repo on a Pulp instance.
> Another team
>    using different repos happens to choose the same name for their
> unrelated
>    utility package. They may not be able to create a content unit if they
> use
>    e.g. the same version or file name.
I agree this is an issue

> 2. A team happens to choose a name that is already known in
>    PyPi/Galaxy/Supermarket. (Or, someone posts a new name on
>    PyPi/Galaxy/Supermarket that happens to be in use in the company for
> years).
>    Then, as above, the team may not be able to create content units for
> their
>    own artifacts.
I agree this is an issue

>    Additionally, *very ugly* things may happen during a sync. The current
>    QueryExistingContentUnits stage may decide that, based on the natural
> key,
>    completely unrelated content units are already present. The stage just
> puts
>    them into the new repo version.

I agree this is an issue

>    Example for pulp_python:
>    Somebody does something very stupid (or very sinister):
>    (The files "Django-1.11.16-py2.py3-none-any.whl" and
> "Django-1.11.16.tar.gz" need
>    to be in the current directory.)
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16-py2.py3-none-any.whl
> | jq -r '._href')
> http POST :8000/pulp/api/v3/content/python/packages/
> artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16.tar.gz
> | jq -r '._href')
> http POST :8000/pulp/api/v3/content/python/packages/
> artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz
Yes, this is a problem, and here are some related thoughts. Core provides
these generic CRUD urls so that plugin writers could get away with never
writing a "one-shot" viewset that receives and parses a content unit via
upload in one call. Using "one-shot" uploaders stops receiving untrusted
metadata from the user (as in your example), but unless the units coming in
are also signed with a trusted key, the data of the file being uploaded
could have been altered. Also the same user likely configured that trusted

>    Somebody else wants to mirror Django 2.0 from PyPi (version_specifier:
> "==2.0"):
I think you've gotten to the crux of the issue here ... "someone else".
Pulp is not currently able to handle real multi-tenancy. A multi-tenancy
system would isolate each users content or provide access to content via
RBAC. We have gotten requests for multi-tenancy from several users who list
it as a must-have. I want to connect this "user-to-user" sharing problem as
actually a multi-tenancy problem.

> http POST :8000/pulp/api/v3/repositories/ name=foo
> export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r
> '.results[] | select(.name == "foo") | ._href')
> http -v POST :8000/pulp/api/v3/remotes/python/  name='bar'  url='
> https://pypi.org/' 'includes:=[{"name": "django",
> "version_specifier":"==2.0"}]'
> export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r
> '.results[] | select(.name == "bar") | ._href')
> http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF
>    Now the created repo version contains bogus content (Django 1.11.16
> instead of 2.0):
> $ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq
> '.["results"] | map(.version, .artifact)'
> [
>   "1.11.16",
>   "/pulp/api/v3/artifacts/1/",
>   "1.11.16",
>   "/pulp/api/v3/artifacts/2/"
> ]
>    A "not so dumb" version of this scenario may happen by error like this:
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.15-py2.py3-none-any.whl
> | jq -r '._href')
> http POST :8000/pulp/api/v3/content/python/packages/
> artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl
> #Forgot to do this: export ARTIFACT_HREF=$(http --form POST
> :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16-py2.py3-none-any.whl |
> jq -r '._href')
> http POST :8000/pulp/api/v3/content/python/packages/
> artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl
>    From now on, no synced repo version on the same Pulp instance will have
> a
>    Django 1.11.16 wheel.
Similar observation here. If Pulp were a multi-tenant system, only that 1
user would have the screwed up content.

> 3. A team releases "module" version "2.0.0" by creating a new version of
> the
>    "release" repo. However, packaging went wrong and the release needs to
> be
>    rebuilt. Nobody wants to use version "2.0.1" for the new shiny release,
> it
>    must be "2.0.0" (the version hasn't been published to the outside world
> yet).
>    How does the team publish a new repo version containing the re-released
>    module? (The best idea I have is: the team needs to create a new version
>    without the content unit first. Then, find _all_ repo versions that
> still
>    reference the content unit and delete them. Delete orphan content units.
>    Create the new content unit and add it to a new repo version).
Yes this is the same process we imagined users would go through. If version
"2.0.0" is stored in multiple repos or repo versions to fully remove the
bad one its unavoidable to unassociate from all repos and then orphan
cleanup. This process is also motivated by the use case I call "get this
unit out of here" which is a situation like shellshock where: "we know this
unit has a CVE in it, it's not safe to store in Pulp anymore". In this area
I can't think of a better way since removing-republishing a unit in a
fully-automated way could have significant unexpected consequences on
published content. It's probably do-able but we would need to be careful.

> 4. A Pulp instance contains unsigned RPM content that will be signed for
>    release. It is not possible to store the signed RPMs on the same
> instance.
>    (Or alternatively, someone just forgot to sign the RPMs when
>    importing/syncing. They will remain unsigned on subsequent syncs even
> if the
>    remote repo has been fixed.)

I agree this is an issue, and we absolutely need to support the workflow.

> (I did not check the behavior in Pulp 2, but most content types have
> fields like
> checksum/commit/repo_id/digest in their unit key.)
> Before discussing implementation options (changing key, adapt sync), I
> have the
> following questions:
> - Is the assessment of the scenarios outlined above correct?
Yes. The thing to keep in mind through all this is that Pulp needs to
compose repos which when presented to a client, e.g. pip, dnf, etc don't
contain the same package twice. So in many ways the uniqueness is about
playing that game up front during upload/sync and not on the backend during
publish time. If it's important to do then doing it early I think is key.

- Do you think it make sense to support (some of) these use cases?

- If so, are there plans to do so that I am not aware of?
No, except that I believe we need to consider multi-tenancy as a possible
solution. There are no plans or discussion on that yet (until now!).
I hope the plugin API post GA introduces some signing feature allowing
users to integrate pulp w/ local and message based signing options. This is
related to your RPM signing point above

> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181106/c75fc3ae/attachment.htm>

More information about the Pulp-dev mailing list