<div dir="ltr"><div dir="ltr">I want to point out that the RPM example is not correct. RPMs are unique in Pulp by checksum (aka pkgId in our code and createrepo_c):<div><br></div><div><a href="https://github.com/pulp/pulp_rpm/blob/44f97560533379ad8680055edff9c3c5bd4e859f/pulp_rpm/app/models.py#L223">https://github.com/pulp/pulp_rpm/blob/44f97560533379ad8680055edff9c3c5bd4e859f/pulp_rpm/app/models.py#L223</a></div><div><br></div><div>Therefore Pulp can store two packages with the same name-epoch-version-arch (NEVRA) as you would in the case where there is a signed and unsigned RPM with the same NEVRA.<br clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><br></div><div>David<br></div></div></div></div></div></div></div></div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Thu, Nov 8, 2018 at 4:16 PM Simon Baatz <<a href="mailto:gmbnomis@gmail.com">gmbnomis@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Nov 06, 2018 at 11:40:35AM -0500, Brian Bouterse wrote:<br>
> These are great questions. I'll try to keep my responses short to<br>
> promote more discussion.<br>
> On Mon, Nov 5, 2018 at 3:21 PM Simon Baatz <[1]<a href="mailto:gmbnomis@gmail.com" target="_blank">gmbnomis@gmail.com</a>><br>
> wrote:<br>
> <br>
> I apologize for the lengthy post, but I did not know where to file<br>
> an issue for<br>
> this. It is a generic problem affecting most Pulp 3 plugins.<br>
> I am puzzled for some time now about the natural keys used for<br>
> content in<br>
> plugins. Examples are:<br>
> pulp_python: 'filename'<br>
> pulp_ansible: 'version', 'role' (for role: 'namespace', 'name')<br>
> pulp_rpm (RPM package): 'name', 'epoch', 'version', 'release',<br>
> 'arch', 'checksum_type', 'pkgId'<br>
> pulp_cookbook: 'name', 'version'<br>
> These look like keys that make sense for content in a single repo<br>
> (version), but<br>
> not necessarily for content in a per plugin pool of content. In an<br>
> ideal world,<br>
> these keys are globally unique, i.e. there is only a single<br>
> "utils-0.9.0" Python<br>
> module world-wide that refers to the same artifacts as the<br>
> "utils-0.9.0" module on<br>
> PyPi. But, as far as I know, the world is far from ideal, especially<br>
> in an<br>
> enterprise setting...<br>
> <br>
> Agreed. This uniqueness is what allows Pulp to recognize and<br>
> deduplicate content in its database. On the filesystem the content<br>
> addressable storage will store identical assets only once, but if Pulp<br>
> couldn't recognize "utils-0.9.0" from one repo as the same as<br>
> "utils-0.9.0" then each sync/upload makes all new content units each<br>
> time.<br>
> <br>
> With the current implementation, the following scenarios could<br>
> happen if I got<br>
> it right:<br>
> 1. In Acme Corp, a team develops a Python module/Ansible role/Chef<br>
> cookbook<br>
> called "acme_utils", which is part of a repo on a Pulp instance.<br>
> Another team<br>
> using different repos happens to choose the same name for their<br>
> unrelated<br>
> utility package. They may not be able to create a content unit if<br>
> they use<br>
> e.g. the same version or file name.<br>
> <br>
> I agree this is an issue<br>
> <br>
> 2. A team happens to choose a name that is already known in<br>
> PyPi/Galaxy/Supermarket. (Or, someone posts a new name on<br>
> PyPi/Galaxy/Supermarket that happens to be in use in the company<br>
> for years).<br>
> Then, as above, the team may not be able to create content units<br>
> for their<br>
> own artifacts.<br>
> <br>
> I agree this is an issue<br>
> <br>
> Additionally, *very ugly* things may happen during a sync. The<br>
> current<br>
> QueryExistingContentUnits stage may decide that, based on the<br>
> natural key,<br>
> completely unrelated content units are already present. The stage<br>
> just puts<br>
> them into the new repo version.<br>
> <br>
> I agree this is an issue<br>
> <br>
> Example for pulp_python:<br>
> Somebody does something very stupid (or very sinister):<br>
> (The files "Django-1.11.16-py2.py3-none-any.whl" and<br>
> "Django-1.11.16.tar.gz" need<br>
> to be in the current directory.)<br>
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/<br>
> file@./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')<br>
> http POST :8000/pulp/api/v3/content/python/packages/<br>
> artifact=$ARTIFACT_HREF filename=Django-2.0-py3-none-any.whl<br>
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/<br>
> file@./Django-1.11.16.tar.gz | jq -r '._href')<br>
> http POST :8000/pulp/api/v3/content/python/packages/<br>
> artifact=$ARTIFACT_HREF filename=Django-2.0.tar.gz<br>
> <br>
> Yes, this is a problem, and here are some related thoughts. Core<br>
> provides these generic CRUD urls so that plugin writers could get away<br>
> with never writing a "one-shot" viewset that receives and parses a<br>
> content unit via upload in one call. Using "one-shot" uploaders stops<br>
> receiving untrusted metadata from the user (as in your example), but<br>
> unless the units coming in are also signed with a trusted key, the data<br>
> of the file being uploaded could have been altered. Also the same user<br>
> likely configured that trusted key.<br>
> <br>
> Somebody else wants to mirror Django 2.0 from PyPi<br>
> (version_specifier: "==2.0"):<br>
> <br>
> I think you've gotten to the crux of the issue here ... "someone else".<br>
> Pulp is not currently able to handle real multi-tenancy. A<br>
> multi-tenancy system would isolate each users content or provide access<br>
> to content via RBAC. We have gotten requests for multi-tenancy from<br>
> several users who list it as a must-have. I want to connect this<br>
> "user-to-user" sharing problem as actually a multi-tenancy problem.<br>
> <br>
> http POST :8000/pulp/api/v3/repositories/ name=foo<br>
> export REPO_HREF=$(http :8000/pulp/api/v3/repositories/ | jq -r<br>
> '.results[] | select(.name == "foo") | ._href')<br>
> http -v POST :8000/pulp/api/v3/remotes/python/ name='bar'<br>
> url='[2]<a href="https://pypi.org/" rel="noreferrer" target="_blank">https://pypi.org/</a>' 'includes:=[{"name": "django",<br>
> "version_specifier":"==2.0"}]'<br>
> export REMOTE_HREF=$(http :8000/pulp/api/v3/remotes/python/ | jq -r<br>
> '.results[] | select(.name == "bar") | ._href')<br>
> http POST :8000$REMOTE_HREF'sync/' repository=$REPO_HREF<br>
> Now the created repo version contains bogus content (Django<br>
> 1.11.16 instead of 2.0):<br>
> $ http :8000/pulp/api/v3/repositories/1/versions/1/content/ | jq<br>
> '.["results"] | map(.version, .artifact)'<br>
> [<br>
> "1.11.16",<br>
> "/pulp/api/v3/artifacts/1/",<br>
> "1.11.16",<br>
> "/pulp/api/v3/artifacts/2/"<br>
> ]<br>
> A "not so dumb" version of this scenario may happen by error like<br>
> this:<br>
> export ARTIFACT_HREF=$(http --form POST :8000/pulp/api/v3/artifacts/<br>
> file@./Django-1.11.15-py2.py3-none-any.whl | jq -r '._href')<br>
> http POST :8000/pulp/api/v3/content/python/packages/<br>
> artifact=$ARTIFACT_HREF filename=Django-1.11.15-py2.py3-none-any.whl<br>
> #Forgot to do this: export ARTIFACT_HREF=$(http --form POST<br>
> :8000/pulp/api/v3/artifacts/<br>
> file@./Django-1.11.16-py2.py3-none-any.whl | jq -r '._href')<br>
> http POST :8000/pulp/api/v3/content/python/packages/<br>
> artifact=$ARTIFACT_HREF filename=Django-1.11.16-py2.py3-none-any.whl<br>
> From now on, no synced repo version on the same Pulp instance<br>
> will have a<br>
> Django 1.11.16 wheel.<br>
> <br>
> Similar observation here. If Pulp were a multi-tenant system, only that<br>
> 1 user would have the screwed up content.<br>
> <br>
> 3. A team releases "module" version "2.0.0" by creating a new<br>
> version of the<br>
> "release" repo. However, packaging went wrong and the release<br>
> needs to be<br>
> rebuilt. Nobody wants to use version "2.0.1" for the new shiny<br>
> release, it<br>
> must be "2.0.0" (the version hasn't been published to the outside<br>
> world yet).<br>
> How does the team publish a new repo version containing the<br>
> re-released<br>
> module? (The best idea I have is: the team needs to create a new<br>
> version<br>
> without the content unit first. Then, find _all_ repo versions<br>
> that still<br>
> reference the content unit and delete them. Delete orphan content<br>
> units.<br>
> Create the new content unit and add it to a new repo version).<br>
> <br>
> Yes this is the same process we imagined users would go through. If<br>
> version "2.0.0" is stored in multiple repos or repo versions to fully<br>
> remove the bad one its unavoidable to unassociate from all repos and<br>
> then orphan cleanup. This process is also motivated by the use case I<br>
> call "get this unit out of here" which is a situation like shellshock<br>
> where: "we know this unit has a CVE in it, it's not safe to store in<br>
> Pulp anymore". In this area I can't think of a better way since<br>
> removing-republishing a unit in a fully-automated way could have<br>
> significant unexpected consequences on published content. It's probably<br>
> do-able but we would need to be careful.<br>
> <br>
> 4. A Pulp instance contains unsigned RPM content that will be signed<br>
> for<br>
> release. It is not possible to store the signed RPMs on the same<br>
> instance.<br>
> (Or alternatively, someone just forgot to sign the RPMs when<br>
> importing/syncing. They will remain unsigned on subsequent syncs<br>
> even if the<br>
> remote repo has been fixed.)<br>
> <br>
> I agree this is an issue, and we absolutely need to support the<br>
> workflow.<br>
> <br>
> (I did not check the behavior in Pulp 2, but most content types have<br>
> fields like<br>
> checksum/commit/repo_id/digest in their unit key.)<br>
> Before discussing implementation options (changing key, adapt sync),<br>
> I have the<br>
> following questions:<br>
> - Is the assessment of the scenarios outlined above correct?<br>
> <br>
> Yes. The thing to keep in mind through all this is that Pulp needs to<br>
> compose repos which when presented to a client, e.g. pip, dnf, etc<br>
> don't contain the same package twice. So in many ways the uniqueness is<br>
> about playing that game up front during upload/sync and not on the<br>
> backend during publish time. If it's important to do then doing it<br>
> early I think is key.<br>
<br>
Yes, it is. But the Pulp plugins mentioned above play this game with<br>
thougher rules than actually required. They enforce that any subset<br>
of the entire content pool (for a plugin) plays along these rules,<br>
not just the content I am syncing or putting into a repo version.<br>
<br>
This make it simpler for Pulp (plugins), as they do not need to<br>
ensure constraints when building a new repo version (depending on the<br>
content type there may be constraints outside of the data model that<br>
need to be ensured). But from the perspective of a user this may<br>
lead to very surprising behavior across repositories and<br>
repository versions (e.g. although repo A and D are perfectly<br>
consistent on a per repo view, repo A does not sync anymore because<br>
repo D happened to have a python module with the same filename in a<br>
version from two months ago).<br>
<br>
(Interestingly, pulp_file plays with relaxed rules that do not ensure that a<br>
repo version can actually be published without clashing filenames.<br>
OTOH, cross repo effects cannot happen there)<br>
<br>
> <br>
> - Do you think it make sense to support (some of) these use cases?<br>
> <br>
> Yes<br>
> <br>
> - If so, are there plans to do so that I am not aware of?<br>
> <br>
> No, except that I believe we need to consider multi-tenancy as a<br>
> possible solution. There are no plans or discussion on that yet (until<br>
> now!).<br>
> I hope the plugin API post GA introduces some signing feature allowing<br>
> users to integrate pulp w/ local and message based signing options.<br>
> This is related to your RPM signing point above<br>
<br>
Although the scenarios outlined above partly are in a multi-tenancy<br>
setting, I don't think that missing support for multi-tenancy is at<br>
the core of the problem. You are right in saying that multi-tenancy<br>
requires isolation on content level. But even without multi-tenancy<br>
(i.e. with full access to all content), I expect a repo manager like<br>
Pulp to provide isolation on repo level:<br>
<br>
1. Unrelated content from other repos must not become visible in a repo<br>
<br>
If I sync two repos from different remotes, I expect the local repo<br>
versions to be mirrors of the respective upstreams. I don't expect<br>
to find content from repo 1 in the mirror of repo 2 just because it<br>
resembles the actual content of repo 2 on a meta-data level.<br>
<br>
Especially, if the remotes provide cryptographic checksums in their<br>
meta data, I can't find any good justification why Pulp should just<br>
decide to ignore it and add unrelated content to a synced repo<br>
version.<br>
<br>
2. Content of other repo versions does not impact my ability to<br>
create a repo version (as long as the created repo version is consistent)<br>
<br>
Basically, there are constraints on two levels:<br>
<br>
1. Uniqueness constraints on overall content<br>
2. Uniqueness constraints on content of a repo version<br>
<br>
Pulp core currently has no direct support for 2 AFAIK. Some plugins seem<br>
to enforce these constraints on level 1, possibly affecting all repo<br>
versions. 'pulp_file' avoids the latter by having more lenient<br>
constraints on level 1, but it has no level 2 constraints and, thus,<br>
does not ensure that a repo version is publishable.<br>
<br>
Maybe we need support for repo version constraints?<br>
<br>
<br>
_______________________________________________<br>
Pulp-dev mailing list<br>
<a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br>
<a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br>
</blockquote></div>