[Pulp-dev] Uniqueness constraints on Content in Pulp 3

Simon Baatz gmbnomis at gmail.com
Thu Nov 8 20:49:56 UTC 2018

On Tue, Nov 06, 2018 at 11:40:35AM -0500, Brian Bouterse wrote:
>    These are great questions. I'll try to keep my responses short to
>    promote more discussion.
>    On Mon, Nov 5, 2018 at 3:21 PM Simon Baatz <[1]gmbnomis at gmail.com>
>    wrote:
>      I apologize for the lengthy post, but I did not know where to file
>      an issue for
>      this. It is a generic problem affecting most Pulp 3 plugins.
>      I am puzzled for some time now about the natural keys used for
>      content in
>      plugins. Examples are:
>      pulp_python: 'filename'
>      pulp_ansible: 'version', 'role' (for role: 'namespace', 'name')
>      pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release',
>      'arch', 'checksum_type', 'pkgId'
>      pulp_cookbook:  'name', 'version'
>      These look like keys that make sense for content in a single repo
>      (version), but
>      not necessarily for content in a per plugin pool of content. In an
>      ideal world,
>      these keys are globally unique, i.e. there is only a single
>      "utils-0.9.0" Python
>      module world-wide that refers to the same artifacts as the
>      "utils-0.9.0" module on
>      PyPi. But, as far as I know, the world is far from ideal, especially
>      in an
>      enterprise setting...
>    Agreed. This uniqueness is what allows Pulp to recognize and
>    deduplicate content in its database. On the filesystem the content
>    addressable storage will store identical assets only once, but if Pulp
>    couldn't recognize "utils-0.9.0" from one repo as the same as
>    "utils-0.9.0" then each sync/upload makes all new content units each
>    time.
>      With the current implementation, the following scenarios could
>      happen if I got
>      it right:
>      1. In Acme Corp, a team develops a Python module/Ansible role/Chef
>      cookbook
>         called "acme_utils", which is part of a repo on a Pulp instance.
>      Another team
>         using different repos happens to choose the same name for their
>      unrelated
>         utility package. They may not be able to create a content unit if
>      they use
>         e.g. the same version or file name.
>    I agree this is an issue
>      2. A team happens to choose a name that is already known in
>         PyPi/Galaxy/Supermarket. (Or, someone posts a new name on
>         PyPi/Galaxy/Supermarket that happens to be in use in the company
>      for years).
>         Then, as above, the team may not be able to create content units
>      for their
>         own artifacts.
>    I agree this is an issue
>         Additionally, *very ugly* things may happen during a sync. The
>      current
>         QueryExistingContentUnits stage may decide that, based on the
>      natural key,
>         completely unrelated content units are already present. The stage
>      just puts
>         them into the new repo version.
>    I agree this is an issue
>         Example for pulp_python:
>         Somebody does something very stupid (or very sinister):
>         (The files "Django-1.11.16-py2.py3-none-any.whl" and
>      "Django-1.11.16.tar.gz" need
>         to be in the current directory.)
>      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
>      file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
>      http POST :8000/pulp/api/v3/content/python/packages/
>      artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl
>      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
>      file at ./Django-1.11.16.tar.gz | jq -r '._href')
>      http POST :8000/pulp/api/v3/content/python/packages/
>      artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz
>    Yes, this is a problem, and here are some related thoughts. Core
>    provides these generic CRUD urls so that plugin writers could get away
>    with never writing a "one-shot" viewset that receives and parses a
>    content unit via upload in one call. Using "one-shot" uploaders stops
>    receiving untrusted metadata from the user (as in your example), but
>    unless the units coming in are also signed with a trusted key, the data
>    of the file being uploaded could have been altered. Also the same user
>    likely configured that trusted key.
>         Somebody else wants to mirror Django 2.0 from PyPi
>      (version_specifier: "==2.0"):
>    I think you've gotten to the crux of the issue here ... "someone else".
>    Pulp is not currently able to handle real multi-tenancy. A
>    multi-tenancy system would isolate each users content or provide access
>    to content via RBAC. We have gotten requests for multi-tenancy from
>    several users who list it as a must-have. I want to connect this
>    "user-to-user" sharing problem as actually a multi-tenancy problem.
>      http POST :8000/pulp/api/v3/repositories/ name=foo
>      export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r
>      '.results[] | select(.name == "foo") | ._href')
>      http -v POST :8000/pulp/api/v3/remotes/python/  name='bar'
>      url='[2]https://pypi.org/' 'includes:=[{"name": "django",
>      "version_specifier":"==2.0"}]'
>      export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r
>      '.results[] | select(.name == "bar") | ._href')
>      http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF
>         Now the created repo version contains bogus content (Django
>      1.11.16 instead of 2.0):
>      $ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq
>      '.["results"] | map(.version, .artifact)'
>      [
>        "1.11.16",
>        "/pulp/api/v3/artifacts/1/",
>        "1.11.16",
>        "/pulp/api/v3/artifacts/2/"
>      ]
>         A "not so dumb" version of this scenario may happen by error like
>      this:
>      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
>      file at ./Django-1.11.15-py2.py3-none-any.whl | jq -r '._href')
>      http POST :8000/pulp/api/v3/content/python/packages/
>      artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl
>      #Forgot to do this: export ARTIFACT_HREF=$(http --form POST
>      :8000/pulp/api/v3/artifacts/
>      file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
>      http POST :8000/pulp/api/v3/content/python/packages/
>      artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl
>         From now on, no synced repo version on the same Pulp instance
>      will have a
>         Django 1.11.16 wheel.
>    Similar observation here. If Pulp were a multi-tenant system, only that
>    1 user would have the screwed up content.
>      3. A team releases "module" version "2.0.0" by creating a new
>      version of the
>         "release" repo. However, packaging went wrong and the release
>      needs to be
>         rebuilt. Nobody wants to use version "2.0.1" for the new shiny
>      release, it
>         must be "2.0.0" (the version hasn't been published to the outside
>      world yet).
>         How does the team publish a new repo version containing the
>      re-released
>         module? (The best idea I have is: the team needs to create a new
>      version
>         without the content unit first. Then, find _all_ repo versions
>      that still
>         reference the content unit and delete them. Delete orphan content
>      units.
>         Create the new content unit and add it to a new repo version).
>    Yes this is the same process we imagined users would go through. If
>    version "2.0.0" is stored in multiple repos or repo versions to fully
>    remove the bad one its unavoidable to unassociate from all repos and
>    then orphan cleanup. This process is also motivated by the use case I
>    call "get this unit out of here" which is a situation like shellshock
>    where: "we know this unit has a CVE in it, it's not safe to store in
>    Pulp anymore". In this area I can't think of a better way since
>    removing-republishing a unit in a fully-automated way could have
>    significant unexpected consequences on published content. It's probably
>    do-able but we would need to be careful.
>      4. A Pulp instance contains unsigned RPM content that will be signed
>      for
>         release. It is not possible to store the signed RPMs on the same
>      instance.
>         (Or alternatively, someone just forgot to sign the RPMs when
>         importing/syncing. They will remain unsigned on subsequent syncs
>      even if the
>         remote repo has been fixed.)
>    I agree this is an issue, and we absolutely need to support the
>    workflow.
>      (I did not check the behavior in Pulp 2, but most content types have
>      fields like
>      checksum/commit/repo_id/digest in their unit key.)
>      Before discussing implementation options (changing key, adapt sync),
>      I have the
>      following questions:
>      - Is the assessment of the scenarios outlined above correct?
>    Yes. The thing to keep in mind through all this is that Pulp needs to
>    compose repos which when presented to a client, e.g. pip, dnf, etc
>    don't contain the same package twice. So in many ways the uniqueness is
>    about playing that game up front during upload/sync and not on the
>    backend during publish time. If it's important to do then doing it
>    early I think is key.

Yes, it is.  But the Pulp plugins mentioned above play this game with
thougher rules than actually required.  They enforce that any subset
of the entire content pool (for a plugin) plays along these rules,
not just the content I am syncing or putting into a repo version.

This make it simpler for Pulp (plugins), as they do not need to
ensure constraints when building a new repo version (depending on the
content type there may be constraints outside of the data model that
need to be ensured).  But from the perspective of a user this may
lead to very surprising behavior across repositories and
repository versions (e.g. although repo A and D are perfectly
consistent on a per repo view, repo A does not sync anymore because
repo D happened to have a python module with the same filename in a
version from two months ago).

(Interestingly, pulp_file plays with relaxed rules that do not ensure that a
repo version can actually be published without clashing filenames.
OTOH, cross repo effects cannot happen there)

>      - Do you think it make sense to support (some of) these use cases?
>    Yes
>      - If so, are there plans to do so that I am not aware of?
>    No, except that I believe we need to consider multi-tenancy as a
>    possible solution. There are no plans or discussion on that yet (until
>    now!).
>    I hope the plugin API post GA introduces some signing feature allowing
>    users to integrate pulp w/ local and message based signing options.
>    This is related to your RPM signing point above

Although the scenarios outlined above partly are in a multi-tenancy
setting, I don't think that missing support for multi-tenancy is at
the core of the problem.  You are right in saying that multi-tenancy
requires isolation on content level.  But even without multi-tenancy
(i.e.  with full access to all content), I expect a repo manager like
Pulp to provide isolation on repo level:

1. Unrelated content from other repos must not become visible in a repo

If I sync two repos from different remotes, I expect the local repo
versions to be mirrors of the respective upstreams.  I don't expect
to find content from repo 1 in the mirror of repo 2 just because it
resembles the actual content of repo 2 on a meta-data level.

Especially, if the remotes provide cryptographic checksums in their
meta data, I can't find any good justification why Pulp should just
decide to ignore it and add unrelated content to a synced repo

2. Content of other repo versions does not impact my ability to
create a repo version (as long as the created repo version is consistent)

Basically, there are constraints on two levels:

1. Uniqueness constraints on overall content
2. Uniqueness constraints on content of a repo version

Pulp core currently has no direct support for 2 AFAIK.  Some plugins seem
to enforce these constraints on level 1, possibly affecting all repo
versions.  'pulp_file' avoids the latter by having more lenient
constraints on level 1, but it has no level 2 constraints and, thus,
does not ensure that a repo version is publishable.

Maybe we need support for repo version constraints?

More information about the Pulp-dev mailing list