[Pulp-dev] Handling RPM with long filelist in Pulp 2

Michael Hrivnak mhrivnak at redhat.com
Tue May 9 23:18:43 UTC 2017


Thanks for sending this nice summary. Thoughts in-line.

On Tue, May 9, 2017 at 6:03 PM, Tatiana Tereshchenko <ttereshc at redhat.com>
wrote:

> Currently Pulp is able to import RPM with filelist up to ~14-15 MB which
> probably cover most repositories but not all of them.
>
> Historically, for each RPM unit several potentially large data snippets
> are stored in db:
>  - XML snippets for RPM metadata
>  - parsed filelist
>  - parsed changelog
>
> XML snippets are compressed and so they require much less space than a
> huge parsed filelist or a changelog.
> Here is the issue [0] to track the effort of eliminating this limitation
> or at least increasing the size of filelist that Pulp can handle for each
> RPM.
>
> The question is what is the best way to handle the issue, keeping in mind
> that any substantial change or re-design introduces more risks and efforts
> to Pulp 2 line and at the same time this won't be an issue in Pulp 3.
>

And to emphasize a certain perspective, this limitation has been present
for most or all of Pulp's existence.


>
> So far the options are:
>  1. Eliminate issue completely (e.g. by using GridFS)
>

If we were sticking with mongodb long-term, this would likely be the best
path. But it's a whole new area of uncertainty. I'm very hesitant to "rock
the boat" with Pulp 2 at this point.


>  2. Increase current limit for filelist by removing parsed version of it
> from db
>

Probably nobody would miss it, but it would be unfortunate to change this
in a Y release.


>  3. Do not solve it in Pulp2, wait for Pulp3 which won't have this issue
> at all
>

There doesn't seem to be any urgency from our users, so waiting for Pulp 3
should be acceptable.


>  4. Any other idea/option
>

We might be able to do this:
- try to save a new RPM
- if it fails with the document too large error, set the filelist and
changelog to None
- try to save it again

This would likely be low-risk, and it would mean only newly-added content
would be missing that particular data from the database. We suspect that
nobody uses those fields anyway, so it would probably be fine. Although I'd
hate to find out that it broke something. And schema divergence itself adds
complexity; we'd need to take this into account when migrating data to
Pulp3.

On the whole, I think waiting to fix this in Pulp 3 is the best option
given what we know. This is just one of many limitations we'll be able to
break free of with Pulp 3, and that can't come soon enough.


> As an additional info:
>  - some thoughts and options [1]  which were considered several months ago
>  - by removing parsed filelist (and changelog?) from db we will give a
> room for a really large RPM metadata. Pulp will be able to import any RPM
> with uncompressed metadata up to ~200MB (~14-15MB currently). Just for
> comparison, this is ~1.5 times bigger than the filelists.xml and other.xml
> together of the whole EPEL7 repo.
>  - removing data from db ^ will affect at least search endpoints like this
> [2] where all the data for unit is returned in response.
>
> [0] https://pulp.plan.io/issues/2747
> [1] https://etherpad.net/p/mongodb_DocumentTooLarge
> [2] http://docs.pulpproject.org/dev-guide/integration/rest-
> api/repo/content.html#advanced-unit-search
>

-- 

Michael Hrivnak

Principal Software Engineer, RHCE

Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20170509/b0be21be/attachment.htm>


More information about the Pulp-dev mailing list