[Pulp-dev] Uniqueness constraints on Content in Pulp 3

Simon Baatz gmbnomis at gmail.com
Mon Nov 5 20:11:31 UTC 2018

I apologize for the lengthy post, but I did not know where to file an issue for
this. It is a generic problem affecting most Pulp 3 plugins.

I am puzzled for some time now about the natural keys used for content in
plugins. Examples are:

pulp_python: 'filename'
pulp_ansible: 'version', 'role' (for role: 'namespace', 'name')
pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release', 'arch', 'checksum_type', 'pkgId'
pulp_cookbook:  'name', 'version'

These look like keys that make sense for content in a single repo (version), but
not necessarily for content in a per plugin pool of content. In an ideal world,
these keys are globally unique, i.e. there is only a single "utils-0.9.0" Python
module world-wide that refers to the same artifacts as the "utils-0.9.0" module on
PyPi. But, as far as I know, the world is far from ideal, especially in an
enterprise setting...

With the current implementation, the following scenarios could happen if I got
it right:

1. In Acme Corp, a team develops a Python module/Ansible role/Chef cookbook
   called "acme_utils", which is part of a repo on a Pulp instance. Another team
   using different repos happens to choose the same name for their unrelated
   utility package. They may not be able to create a content unit if they use
   e.g. the same version or file name.

2. A team happens to choose a name that is already known in
   PyPi/Galaxy/Supermarket. (Or, someone posts a new name on
   PyPi/Galaxy/Supermarket that happens to be in use in the company for years).
   Then, as above, the team may not be able to create content units for their
   own artifacts.

   Additionally, *very ugly* things may happen during a sync. The current
   QueryExistingContentUnits stage may decide that, based on the natural key,
   completely unrelated content units are already present. The stage just puts
   them into the new repo version.

   Example for pulp_python:

   Somebody does something very stupid (or very sinister):

   (The files "Django-1.11.16-py2.py3-none-any.whl" and "Django-1.11.16.tar.gz" need
   to be in the current directory.)

export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
http POST :8000/pulp/api/v3/content/python/packages/ artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl
export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16.tar.gz | jq -r '._href')
http POST :8000/pulp/api/v3/content/python/packages/ artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz

   Somebody else wants to mirror Django 2.0 from PyPi (version_specifier: "==2.0"):

http POST :8000/pulp/api/v3/repositories/ name=foo
export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r '.results[] | select(.name == "foo") | ._href')
http -v POST :8000/pulp/api/v3/remotes/python/  name='bar'  url='https://pypi.org/' 'includes:=[{"name": "django", "version_specifier":"==2.0"}]'
export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r '.results[] | select(.name == "bar") | ._href')
http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF

   Now the created repo version contains bogus content (Django 1.11.16 instead of 2.0):

$ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq '.["results"] | map(.version, .artifact)'

   A "not so dumb" version of this scenario may happen by error like this:

export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.15-py2.py3-none-any.whl | jq -r '._href')
http POST :8000/pulp/api/v3/content/python/packages/ artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl
#Forgot to do this: export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/ file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
http POST :8000/pulp/api/v3/content/python/packages/ artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl

   From now on, no synced repo version on the same Pulp instance will have a
   Django 1.11.16 wheel.

3. A team releases "module" version "2.0.0" by creating a new version of the
   "release" repo. However, packaging went wrong and the release needs to be
   rebuilt. Nobody wants to use version "2.0.1" for the new shiny release, it
   must be "2.0.0" (the version hasn't been published to the outside world yet).
   How does the team publish a new repo version containing the re-released
   module? (The best idea I have is: the team needs to create a new version
   without the content unit first. Then, find _all_ repo versions that still
   reference the content unit and delete them. Delete orphan content units.
   Create the new content unit and add it to a new repo version).

4. A Pulp instance contains unsigned RPM content that will be signed for
   release. It is not possible to store the signed RPMs on the same instance.
   (Or alternatively, someone just forgot to sign the RPMs when
   importing/syncing. They will remain unsigned on subsequent syncs even if the
   remote repo has been fixed.)

(I did not check the behavior in Pulp 2, but most content types have fields like
checksum/commit/repo_id/digest in their unit key.)

Before discussing implementation options (changing key, adapt sync), I have the
following questions:

- Is the assessment of the scenarios outlined above correct?
- Do you think it make sense to support (some of) these use cases?
- If so, are there plans to do so that I am not aware of?

More information about the Pulp-dev mailing list