[Pulp-dev] pulp_rpm and The Pitfalls of urljoin()

David Davis daviddavis at redhat.com
Mon Mar 15 20:41:30 UTC 2021


I've been bitten by this a couple times. I also noticed that ansible
actually defines its own version of urljoin:

https://github.com/ansible/ansible/blob/00bd0b893d5d21de040b53032c466707bacb3b93/lib/ansible/galaxy/api.py#L166-L167

David


On Mon, Mar 15, 2021 at 3:44 PM Grant Gainey <ggainey at redhat.com> wrote:

> Hey folks,
>
> I was looking at https://pulp.plan.io/issues/7995 with an eye to fixing
> it for 3.12. The underlying problem can be summed up as "urljoin doesn't do
> what you expect if 'base' doesn't end in a '/'"[0][1]. Using urljoin() as
> if all it does is "concatenate these two strings with a '/' between them"
> is a pretty common misuse that works 'most of the time', alas.
>
> Looking around to see if we do this anywhere else, I find that pulp_rpm
> uses urljoin() in a number of places that I think might be subject to the
> same unintended problem. However, the semantics of sync'ing RPM
> repositories is even *more* nuanced than urljoin()'s is!
>
> I am tempted to replace urljoin() with just straightforward path-creation.
> Here's the list of places in pulp_rpm non-test python that uses it
> currently:
>
> (pulp3) (master) ~/github/Pulp3/pulp_rpm $ find . -name \*.py | grep -v
> tests | xargs grep -n "urljoin("
> ./pulp_rpm/app/kickstart/treeinfo.py:23:
>  url=urljoin(remote_url, namespace),
> silence_errors_for_response_status_codes={403, 404}
> ./pulp_rpm/app/models/repository.py:355:                gpgkey_path =
> urllib.parse.urljoin(
> ./pulp_rpm/app/models/repository.py:358:                gpgkey_path =
> urllib.parse.urljoin(gpgkey_path, self.base_path, True)
> ./pulp_rpm/app/tasks/synchronizing.py:101:    downloader =
> remote.get_downloader(url=urljoin(url, "repodata/repomd.xml"))
> ./pulp_rpm/app/tasks/synchronizing.py:138:    downloader =
> remote.get_downloader(url=urljoin(remote_url, "repodata/repomd.xml"))
> ./pulp_rpm/app/tasks/synchronizing.py:243:            new_url =
> urljoin(remote_url, path)
> ./pulp_rpm/app/tasks/synchronizing.py:430:
>  url=urljoin(self.data.remote_url, "repodata/repomd.xml")
> ./pulp_rpm/app/tasks/synchronizing.py:460:
>  url=urljoin(self.data.remote_url, self.treeinfo["filename"]),
> ./pulp_rpm/app/tasks/synchronizing.py:470:
>  url=urljoin(self.data.remote_url, path),
> ./pulp_rpm/app/tasks/synchronizing.py:599:                url =
> urljoin(self.data.remote_url, package.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:722:        repodata_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:726:        self.data.updateinfo_url
> = urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:731:        comps_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:735:        self.data.modules_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:743:
>  url=urljoin(self.data.remote_url, record.location_href),
> ./pulp_rpm/app/downloaders.py:84:            url = urljoin(self.url,
> auth_param)
> (pulp3) (master) ~/github/Pulp3/pulp_rpm $
>
> Any thoughts before I dive down this rabbithole? I'm afraid I don't even
> have a pocketwatch...
>
> G
>
> [0] https://docs.python.org/2/library/urlparse.html#urlparse.urljoin
> [1] https://stackoverflow.com/a/10893427
> --
> Grant Gainey
> Principal Software Engineer, Red Hat System Management Engineering
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://listman.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20210315/6871717a/attachment.htm>


More information about the Pulp-dev mailing list