[Pulp-dev] pulp_rpm and The Pitfalls of urljoin()

Grant Gainey ggainey at redhat.com
Thu Mar 18 11:26:52 UTC 2021

On Thu, Mar 18, 2021 at 4:32 AM Quirin Pamp <pamp at atix.de> wrote:

> Hi,
> it looks like there is now a commit on pulp_rpm that uses something named
> "urlpath_sanitize" instead of urljoin.
> This is causing a major clash with this PR:
> https://github.com/pulp/pulp_rpm/pull/1896 which has replaced a lot of
> urljoin's with os.path.join.
> My problem is that since I am not the original author of this PR (I merely
> adopted it), I am not sure what the exact background of the switch to
> "os.path.join" is.
> I presume, it was necessary (and perhaps trying to solve the same issue as
> the "urlpath_sanitize").
> I can of course rebase the path and test it out in either version (using
> urlpath_sanitize vs os.path.join), but if you have already put some recent
> thought into the intricacies of urljoin, I would appreciate your input
> before I do. 😉

Well, this raises a really good point.

Both of these PRs address the fact that the code was using 'urljoin' when
all it really wanted (in almost all cases) was to join a base-url (like,
say, "remote_url") and a relative-path fragment. Places where base-url
didn't end in '/' caused a variety of subtle problems.

os.path.join()  does, in fact, address this problem - *on Linux systems*.
If you're running python on a system that uses something other than '/' for
the path-separator (like, say, Windows' '\') - then os.path.join() can do
Bad Things to your urls.

I built url_sanitize() to get what was wanted from the current uses of
urljoin(), while avoiding the pitfalls it was causing. Replacing it with
os.path.join() fixes that set of problems as well, but it feels like just
setting a different landmine to blow us up in the future, if/when we
support running Pulp3 on something non-Linux.

And in addition, the entities we're dealing with here, generally *aren't
filesystem paths*. It's misleading the future developer to imply that's
what they are.

Anyway, those were my thoughts when I went the url_sanitize path. More
and/or better ones, always welcome!


(My only other thoughts here are that the semantics of 'urljoin()' are WAY
too subtle, and we are not alone in writing code that is eventually
inadvertently broken by them. I honestly cannot recall ever seeing
urljoin() used in a place where those semantics were actually desired. But
that's just me being grumpy...)

> Regards,
> Quirin (quba42)
> ------------------------------
> *From:* pulp-dev-bounces at redhat.com <pulp-dev-bounces at redhat.com> on
> behalf of Grant Gainey <ggainey at redhat.com>
> *Sent:* 15 March 2021 20:43
> *To:* Pulp-dev <pulp-dev at redhat.com>
> *Subject:* [Pulp-dev] pulp_rpm and The Pitfalls of urljoin()
> Hey folks,
> I was looking at https://pulp.plan.io/issues/7995 with an eye to fixing
> it for 3.12. The underlying problem can be summed up as "urljoin doesn't do
> what you expect if 'base' doesn't end in a '/'"[0][1]. Using urljoin() as
> if all it does is "concatenate these two strings with a '/' between them"
> is a pretty common misuse that works 'most of the time', alas.
> Looking around to see if we do this anywhere else, I find that pulp_rpm
> uses urljoin() in a number of places that I think might be subject to the
> same unintended problem. However, the semantics of sync'ing RPM
> repositories is even *more* nuanced than urljoin()'s is!
> I am tempted to replace urljoin() with just straightforward path-creation.
> Here's the list of places in pulp_rpm non-test python that uses it
> currently:
> (pulp3) (master) ~/github/Pulp3/pulp_rpm $ find . -name \*.py | grep -v
> tests | xargs grep -n "urljoin("
> ./pulp_rpm/app/kickstart/treeinfo.py:23:
>  url=urljoin(remote_url, namespace),
> silence_errors_for_response_status_codes={403, 404}
> ./pulp_rpm/app/models/repository.py:355:                gpgkey_path =
> urllib.parse.urljoin(
> ./pulp_rpm/app/models/repository.py:358:                gpgkey_path =
> urllib.parse.urljoin(gpgkey_path, self.base_path, True)
> ./pulp_rpm/app/tasks/synchronizing.py:101:    downloader =
> remote.get_downloader(url=urljoin(url, "repodata/repomd.xml"))
> ./pulp_rpm/app/tasks/synchronizing.py:138:    downloader =
> remote.get_downloader(url=urljoin(remote_url, "repodata/repomd.xml"))
> ./pulp_rpm/app/tasks/synchronizing.py:243:            new_url =
> urljoin(remote_url, path)
> ./pulp_rpm/app/tasks/synchronizing.py:430:
>  url=urljoin(self.data.remote_url, "repodata/repomd.xml")
> ./pulp_rpm/app/tasks/synchronizing.py:460:
>  url=urljoin(self.data.remote_url, self.treeinfo["filename"]),
> ./pulp_rpm/app/tasks/synchronizing.py:470:
>  url=urljoin(self.data.remote_url, path),
> ./pulp_rpm/app/tasks/synchronizing.py:599:                url =
> urljoin(self.data.remote_url, package.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:722:        repodata_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:726:        self.data.updateinfo_url
> = urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:731:        comps_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:735:        self.data.modules_url =
> urljoin(self.data.remote_url, record.location_href)
> ./pulp_rpm/app/tasks/synchronizing.py:743:
>  url=urljoin(self.data.remote_url, record.location_href),
> ./pulp_rpm/app/downloaders.py:84:            url = urljoin(self.url,
> auth_param)
> (pulp3) (master) ~/github/Pulp3/pulp_rpm $
> Any thoughts before I dive down this rabbithole? I'm afraid I don't even
> have a pocketwatch...
> G
> [0] https://docs.python.org/2/library/urlparse.html#urlparse.urljoin
> [1] https://stackoverflow.com/a/10893427
> --
> Grant Gainey
> Principal Software Engineer, Red Hat System Management Engineering

Grant Gainey
Principal Software Engineer, Red Hat System Management Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20210318/f0122486/attachment.htm>

More information about the Pulp-dev mailing list