[Pulp-list] Pulp and package.io

Sat Jan 31 20:52:52 UTC 2015

David,

Thanks for asking about this. Pulp will happily sync any valid yum repository.

Using a variation of the link you provided [0], I was able to sync one of their repositories with pulp 2.5 (with one catch that I'll get to in a moment). However, they don't make it easy to figure out what repository URL to use. I had to hack up their "installer" script [1] to see what link it would generate.

FWIW, it seems that their "el/7/x86_64/" repository has no packages. You can download the XML file [2] that contains a package list and see that it has no entries. If you tried to sync this, there would be no errors, but also no packages retrieved. Perhaps that was a source of confusion?

Just to clarify, their implementation details of redirecting to S3 links is reasonable and should be completely transparent to the user of any HTTP client, yum and pulp included. For anyone who wants to understand what's going on under the hood, see the example [3] below. Accessing a file's URL returns a 302 redirect to a time-bombed S3 URL. I don't know why they're using signed/expiring URLs when the links on packagecloud.io are wide-open, but there's certainly no harm.

Now for the catch... some of these rpms are huge. Not just in bytes (~140MB), but tens of thousands of files. Many have 50,000-60,000 files. It looks like there is practically an entire operating system bundled into one rpm. While this is technically possible, it's not a normal use of the rpm package format, and pulp is not able to catalog some of these rpms. The problem is that there is so much metadata (mostly the file list), that it literally won't fit into a single mongodb object. Unfortunately, we don't have a good solution right now for handling rpms that large. Ideas are welcome. In theory, we could compress the XML before saving it in the database, but I wonder what impact that would have on our publish performance.

In any case, I hope this is helpful. Let me know if you have any additional questions.

Michael

[0] https://packagecloud.io/chef/stable/el/6/x86_64/
[1] https://packagecloud.io/chef/stable/install
[2] https://packagecloud.io/chef/stable/el/7/x86_64/repodata/primary.xml.gz
[3] $ curl -I https://packagecloud.io/chef/stable/el/7/x86_64/repodata/primary.xml.gz
HTTP/1.1 302 Found
Server: nginx/1.1.19
Date: Sat, 31 Jan 2015 17:20:40 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 0
Connection: keep-alive
Status: 302 Found
Location: https://packagecloud-repositories.s3.amazonaws.com/empty/rpm/primary.xml.gz?AWSAccessKeyId=AKIAI44QGWC7C5WEV4XA&Signature=Wq80Dw1MhI9kFe8OoB3puB6kJmw%3D&Expires=1422725140
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Cache-Control: no-cache
X-Request-Id: a73f3650-74b6-4b3c-868e-3fbc4b345282
X-Runtime: 0.068724
Strict-Transport-Security: max-age=31536000
X-Frame-Options: DENY

----- Original Message -----
> From: "David Somers-Harris | David | OPS" <david.somers-harris at mail.rakuten.com>
> To: pulp-list at redhat.com
> Sent: Saturday, January 31, 2015 9:07:51 AM
> Subject: [Pulp-list] Pulp and package.io
> 
> 
> 
> Hello,
> 
> 
> 
> I’m trying to sync the Chef repository (hosted in package.io) into Foreman
> (theforeman.org) which uses Pulp but I’m not having any luck.
> 
> When I contacted support at package.io, it turns out that they store
> everything in S3 storage in the background and only expose over http what
> yum needs to be able to see the packages.
> 
> 
> 
> However, this is apparently not enough for pulp to be able to do
> repo-discovery and sync the repository.
> 
> What does pulp expect when it’s looking at a repository?
> 
> (e.g. it looks like pulp breaks if the actual URI of the rpm is not the same
> as the URI of the directory structure)
> 
> Are these expectations documented somewhere?
> 
> 
> 
> In short I want to give the guys at package.io a list of what pulp expects to
> see if there is anything they can do about supporting it.
> 
> 
> 
> 
> Thanks,
> 
> David
> 
> 
> 
> 
> 
> From: support.16458.940aef3ec148f754 at helpscout.net
> [mailto:support.16458.940aef3ec148f754 at helpscout.net] On Behalf Of
> packagecloud.io support
> Sent: Friday, January 30, 2015 1:45 AM
> To: Somers-Harris, David | David | OPS
> Subject: Re: Repo Syncing
> 
> 
> 
> 	
> 	
> 	Joe
> 	
> 
> Jan 29 4:44pm
> 	
> 
> 
> Yes we are using S3. It's likely that pulp and similar tools would use the
> actual metadata found in the repository as opposed to traversing the
> directory structure itself.
> 
> Can you share some example URLs that work and I can show you similar URLs on
> packagecloud? In theory, pulp should simply need to know where to find the
> yum metadata and everything else will be taken care of itself.
> 
> --
> Joe Damato
> support at packagecloud.io
> 
> 	
> 	
> 	
> 	David | David | Ops Somers-Harris
> 	
> 
> Jan 29 9:34am
> 	
> 
> 
> Hi Joe,
> 
> 
> 
> Thanks for the reply.
> 
> 
> 
> Foreman uses Pulp for handling its repositories.
> 
> http://www.pulpproject.org/
> 
> 
> 
> I think it basically does an http scrub with something similar to rsync.
> 
> We don't mind hosting large amount of data locally, it gives us more control
> and reduces our bandwidth.
> 
> 
> 
> I think Package Cloud would either need to simulate the full directory over
> http or Pulp would need to have a plugin to understand your API.
> 
> Do you use object storage compatible with S3?
> 
> 
> 
> 
> Regards,
> 
> David Somers-Harris
> 
> Global Operations Department
> 
> 
> 	
> 	
> 	
> 	Joe
> 	
> 
> Jan 26 7:58am
> 	
> 
> 
> Hi David:
> 
> No, that's not possible because packagecloud doesn't work that way -- there
> are no actual directories mapped to a filesystem as you would get if you
> were using createrepo. I have no idea how Foreman works, but if you can
> provide more details on how Foreman's syncing/mirroring works, I can
> probably help you figure out what you need to do to accomplish this.
> packagecloud serves up files and metadata at URLs that yum and apt expect
> but those URLs are just an abstraction over how we store the data.
> 
> Keep in mind that packagecloud is actually able to retain all previous
> versions of uploaded packages, which means that if you are mirroring the
> entire Chef Stable Enterprise Linux repository for any individual version of
> Enterprise Linux, you will be consuming *considerable* disk space on your
> side.
> 
> --
> Joe Damato
> support at packagecloud.io
> 
> 	
> 	
> 	
> 	David | David | Ops Somers-Harris
> 	
> 
> Jan 26 7:24am
> 	
> 
> 
> Hello,
> 
> 
> 
> 
> I would like to see directory listing under
> https://packagecloud.io/chef/stable/el so that I can sync to my local repo
> into Foreman .
> Is this possible?
> 
> 
> 
> 
> Thanks,
> David
> 
> 	
> 	
> 	
> 
> 
> 
> 
> {#HS:67557052-1014#}
> 
> 
> 
> 
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list