[Pulp-dev] Uniqueness constraints on Content in Pulp 3

David Davis daviddavis at redhat.com
Tue Nov 13 19:38:33 UTC 2018


I want to point out that the RPM example is not correct. RPMs are unique in
Pulp by checksum (aka pkgId in our code and createrepo_c):

https://github.com/pulp/pulp_rpm/blob/44f97560533379ad8680055edff9c3c5bd4e859f/pulp_rpm/app/models.py#L223

Therefore Pulp can store two packages with the same name-epoch-version-arch
(NEVRA) as you would in the case where there is a signed and unsigned RPM
with the same NEVRA.

David


On Thu, Nov 8, 2018 at 4:16 PM Simon Baatz <gmbnomis at gmail.com> wrote:

> On Tue, Nov 06, 2018 at 11:40:35AM -0500, Brian Bouterse wrote:
> >    These are great questions. I'll try to keep my responses short to
> >    promote more discussion.
> >    On Mon, Nov 5, 2018 at 3:21 PM Simon Baatz <[1]gmbnomis at gmail.com>
> >    wrote:
> >
> >      I apologize for the lengthy post, but I did not know where to file
> >      an issue for
> >      this. It is a generic problem affecting most Pulp 3 plugins.
> >      I am puzzled for some time now about the natural keys used for
> >      content in
> >      plugins. Examples are:
> >      pulp_python: 'filename'
> >      pulp_ansible: 'version', 'role' (for role: 'namespace', 'name')
> >      pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release',
> >      'arch', 'checksum_type', 'pkgId'
> >      pulp_cookbook:  'name', 'version'
> >      These look like keys that make sense for content in a single repo
> >      (version), but
> >      not necessarily for content in a per plugin pool of content. In an
> >      ideal world,
> >      these keys are globally unique, i.e. there is only a single
> >      "utils-0.9.0" Python
> >      module world-wide that refers to the same artifacts as the
> >      "utils-0.9.0" module on
> >      PyPi. But, as far as I know, the world is far from ideal, especially
> >      in an
> >      enterprise setting...
> >
> >    Agreed. This uniqueness is what allows Pulp to recognize and
> >    deduplicate content in its database. On the filesystem the content
> >    addressable storage will store identical assets only once, but if Pulp
> >    couldn't recognize "utils-0.9.0" from one repo as the same as
> >    "utils-0.9.0" then each sync/upload makes all new content units each
> >    time.
> >
> >      With the current implementation, the following scenarios could
> >      happen if I got
> >      it right:
> >      1. In Acme Corp, a team develops a Python module/Ansible role/Chef
> >      cookbook
> >         called "acme_utils", which is part of a repo on a Pulp instance.
> >      Another team
> >         using different repos happens to choose the same name for their
> >      unrelated
> >         utility package. They may not be able to create a content unit if
> >      they use
> >         e.g. the same version or file name.
> >
> >    I agree this is an issue
> >
> >      2. A team happens to choose a name that is already known in
> >         PyPi/Galaxy/Supermarket. (Or, someone posts a new name on
> >         PyPi/Galaxy/Supermarket that happens to be in use in the company
> >      for years).
> >         Then, as above, the team may not be able to create content units
> >      for their
> >         own artifacts.
> >
> >    I agree this is an issue
> >
> >         Additionally, *very ugly* things may happen during a sync. The
> >      current
> >         QueryExistingContentUnits stage may decide that, based on the
> >      natural key,
> >         completely unrelated content units are already present. The stage
> >      just puts
> >         them into the new repo version.
> >
> >    I agree this is an issue
> >
> >         Example for pulp_python:
> >         Somebody does something very stupid (or very sinister):
> >         (The files "Django-1.11.16-py2.py3-none-any.whl" and
> >      "Django-1.11.16.tar.gz" need
> >         to be in the current directory.)
> >      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
> >      file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
> >      http POST :8000/pulp/api/v3/content/python/packages/
> >      artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl
> >      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
> >      file at ./Django-1.11.16.tar.gz | jq -r '._href')
> >      http POST :8000/pulp/api/v3/content/python/packages/
> >      artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz
> >
> >    Yes, this is a problem, and here are some related thoughts. Core
> >    provides these generic CRUD urls so that plugin writers could get away
> >    with never writing a "one-shot" viewset that receives and parses a
> >    content unit via upload in one call. Using "one-shot" uploaders stops
> >    receiving untrusted metadata from the user (as in your example), but
> >    unless the units coming in are also signed with a trusted key, the
> data
> >    of the file being uploaded could have been altered. Also the same user
> >    likely configured that trusted key.
> >
> >         Somebody else wants to mirror Django 2.0 from PyPi
> >      (version_specifier: "==2.0"):
> >
> >    I think you've gotten to the crux of the issue here ... "someone
> else".
> >    Pulp is not currently able to handle real multi-tenancy. A
> >    multi-tenancy system would isolate each users content or provide
> access
> >    to content via RBAC. We have gotten requests for multi-tenancy from
> >    several users who list it as a must-have. I want to connect this
> >    "user-to-user" sharing problem as actually a multi-tenancy problem.
> >
> >      http POST :8000/pulp/api/v3/repositories/ name=foo
> >      export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r
> >      '.results[] | select(.name == "foo") | ._href')
> >      http -v POST :8000/pulp/api/v3/remotes/python/  name='bar'
> >      url='[2]https://pypi.org/' 'includes:=[{"name": "django",
> >      "version_specifier":"==2.0"}]'
> >      export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r
> >      '.results[] | select(.name == "bar") | ._href')
> >      http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF
> >         Now the created repo version contains bogus content (Django
> >      1.11.16 instead of 2.0):
> >      $ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq
> >      '.["results"] | map(.version, .artifact)'
> >      [
> >        "1.11.16",
> >        "/pulp/api/v3/artifacts/1/",
> >        "1.11.16",
> >        "/pulp/api/v3/artifacts/2/"
> >      ]
> >         A "not so dumb" version of this scenario may happen by error like
> >      this:
> >      export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/
> >      file at ./Django-1.11.15-py2.py3-none-any.whl | jq -r '._href')
> >      http POST :8000/pulp/api/v3/content/python/packages/
> >      artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl
> >      #Forgot to do this: export ARTIFACT_HREF=$(http --form POST
> >      :8000/pulp/api/v3/artifacts/
> >      file at ./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')
> >      http POST :8000/pulp/api/v3/content/python/packages/
> >      artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl
> >         From now on, no synced repo version on the same Pulp instance
> >      will have a
> >         Django 1.11.16 wheel.
> >
> >    Similar observation here. If Pulp were a multi-tenant system, only
> that
> >    1 user would have the screwed up content.
> >
> >      3. A team releases "module" version "2.0.0" by creating a new
> >      version of the
> >         "release" repo. However, packaging went wrong and the release
> >      needs to be
> >         rebuilt. Nobody wants to use version "2.0.1" for the new shiny
> >      release, it
> >         must be "2.0.0" (the version hasn't been published to the outside
> >      world yet).
> >         How does the team publish a new repo version containing the
> >      re-released
> >         module? (The best idea I have is: the team needs to create a new
> >      version
> >         without the content unit first. Then, find _all_ repo versions
> >      that still
> >         reference the content unit and delete them. Delete orphan content
> >      units.
> >         Create the new content unit and add it to a new repo version).
> >
> >    Yes this is the same process we imagined users would go through. If
> >    version "2.0.0" is stored in multiple repos or repo versions to fully
> >    remove the bad one its unavoidable to unassociate from all repos and
> >    then orphan cleanup. This process is also motivated by the use case I
> >    call "get this unit out of here" which is a situation like shellshock
> >    where: "we know this unit has a CVE in it, it's not safe to store in
> >    Pulp anymore". In this area I can't think of a better way since
> >    removing-republishing a unit in a fully-automated way could have
> >    significant unexpected consequences on published content. It's
> probably
> >    do-able but we would need to be careful.
> >
> >      4. A Pulp instance contains unsigned RPM content that will be signed
> >      for
> >         release. It is not possible to store the signed RPMs on the same
> >      instance.
> >         (Or alternatively, someone just forgot to sign the RPMs when
> >         importing/syncing. They will remain unsigned on subsequent syncs
> >      even if the
> >         remote repo has been fixed.)
> >
> >    I agree this is an issue, and we absolutely need to support the
> >    workflow.
> >
> >      (I did not check the behavior in Pulp 2, but most content types have
> >      fields like
> >      checksum/commit/repo_id/digest in their unit key.)
> >      Before discussing implementation options (changing key, adapt sync),
> >      I have the
> >      following questions:
> >      - Is the assessment of the scenarios outlined above correct?
> >
> >    Yes. The thing to keep in mind through all this is that Pulp needs to
> >    compose repos which when presented to a client, e.g. pip, dnf, etc
> >    don't contain the same package twice. So in many ways the uniqueness
> is
> >    about playing that game up front during upload/sync and not on the
> >    backend during publish time. If it's important to do then doing it
> >    early I think is key.
>
> Yes, it is.  But the Pulp plugins mentioned above play this game with
> thougher rules than actually required.  They enforce that any subset
> of the entire content pool (for a plugin) plays along these rules,
> not just the content I am syncing or putting into a repo version.
>
> This make it simpler for Pulp (plugins), as they do not need to
> ensure constraints when building a new repo version (depending on the
> content type there may be constraints outside of the data model that
> need to be ensured).  But from the perspective of a user this may
> lead to very surprising behavior across repositories and
> repository versions (e.g. although repo A and D are perfectly
> consistent on a per repo view, repo A does not sync anymore because
> repo D happened to have a python module with the same filename in a
> version from two months ago).
>
> (Interestingly, pulp_file plays with relaxed rules that do not ensure that
> a
> repo version can actually be published without clashing filenames.
> OTOH, cross repo effects cannot happen there)
>
> >
> >      - Do you think it make sense to support (some of) these use cases?
> >
> >    Yes
> >
> >      - If so, are there plans to do so that I am not aware of?
> >
> >    No, except that I believe we need to consider multi-tenancy as a
> >    possible solution. There are no plans or discussion on that yet (until
> >    now!).
> >    I hope the plugin API post GA introduces some signing feature allowing
> >    users to integrate pulp w/ local and message based signing options.
> >    This is related to your RPM signing point above
>
> Although the scenarios outlined above partly are in a multi-tenancy
> setting, I don't think that missing support for multi-tenancy is at
> the core of the problem.  You are right in saying that multi-tenancy
> requires isolation on content level.  But even without multi-tenancy
> (i.e.  with full access to all content), I expect a repo manager like
> Pulp to provide isolation on repo level:
>
> 1. Unrelated content from other repos must not become visible in a repo
>
> If I sync two repos from different remotes, I expect the local repo
> versions to be mirrors of the respective upstreams.  I don't expect
> to find content from repo 1 in the mirror of repo 2 just because it
> resembles the actual content of repo 2 on a meta-data level.
>
> Especially, if the remotes provide cryptographic checksums in their
> meta data, I can't find any good justification why Pulp should just
> decide to ignore it and add unrelated content to a synced repo
> version.
>
> 2. Content of other repo versions does not impact my ability to
> create a repo version (as long as the created repo version is consistent)
>
> Basically, there are constraints on two levels:
>
> 1. Uniqueness constraints on overall content
> 2. Uniqueness constraints on content of a repo version
>
> Pulp core currently has no direct support for 2 AFAIK.  Some plugins seem
> to enforce these constraints on level 1, possibly affecting all repo
> versions.  'pulp_file' avoids the latter by having more lenient
> constraints on level 1, but it has no level 2 constraints and, thus,
> does not ensure that a repo version is publishable.
>
> Maybe we need support for repo version constraints?
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181113/4b7fda6f/attachment.htm>


More information about the Pulp-dev mailing list