[Pulp-dev] Repo version implementation
daviddavis at redhat.com
Tue Dec 12 21:46:10 UTC 2017
I think the use case you outline about adding/removing a small subset of
units is compelling as I imagine most always a new version will only add or
remove a small subset of units to the latest version.
The performance concerns around option 1 is worth a closer look. Katello
versions its content in a manner similar to the first option; it has been
dealing with hundreds of millions of associations between versioned repos
and content, and Postgresql has never been a problem (usually it’s
MongoDB). However, I’d like to run some benchmarks to maybe confirm for
sure whether it’ll be a problem. I talked to @dkliban and @bmbouter about
this and we came up with an outline of how to maybe test managing one
billion association records:
I’m planning on coding this up tomorrow in a django console script and
seeing how scaling up from 0 to 1 billion records affects Postgres’
On Tue, Dec 12, 2017 at 12:26 PM, Michael Hrivnak <mhrivnak at redhat.com>
> I expect both options to have equal ease of use for plugin writers.
> In both cases, I would expect the RepositoryVersion object to have a
> "content" attribute that returns a QuerySet. That's what the PR does
> currently, and the other approach could easily do the same.
> For adding and removing content, most plugins will let the core do that
> for them by using changesets. Any plugins that choose the DIY approach will
> do one of the following depending on which option is chosen:
> I don't think either places a burden on the plugin writer.
> If option 1 is chosen, some thought will be needed around where/when all
> the new relationships get made between a new version and its content. Would
> the core create an empty version and expect the plugin to fully populate it
> each time? Or would the core create a new version with the same content set
> as its predecessor, and then let the plugin add/remove as necessary?
> As for comparing versions, option 2 makes that very easy. Tracking the
> changes directly makes it easy to report on those changes quickly and
> For background, option 2 was created to accommodate the most common use
> case, and the one where our users have proven most performance-sensitive:
> after an initial large sync of a repo, additional content trickles in as a
> series of small changes (think bug fixes on a RHEL release). The changes
> need to be fast to write (during sync) and fast to read (incremental
> publish, incremental applicability calculation, etc). Either approach will
> likely work fine on a lightly-loaded system. But in a heavily-loaded
> environment similar to where we see Pulp 2 often running, you likely would
> see a meaningful difference between 10 inserts and 10,000 inserts.
> The other motivation was the issue of scale. Postgresql is a great
> database, but lots of data is lots of data. Consider a user with 10 repos,
> 10k content units in each, and 10 versions of each. That's a very small use
> case, and already would be 1M associations under option 1. As any of those
> numbers increase, you quickly get to hundreds of millions of associations
> for even a medium-sized deployment, which can have real impact on query
> performance, index size (you want your index in RAM when possible), index
> updates, not to mention the time it takes for a database backup (or
> restore!). So if you want to go with option 1, I encourage seeking
> realistic performance expectations first.
> I'm happy to make the last few updates to the PR for option 2, but I
> suppose I should wait for this discussion to come to a conclusion first.
> Keep me posted if you want to green-light option 2.
> On Tue, Dec 12, 2017 at 9:28 AM, Jeremy Audet <jaudet at redhat.com> wrote:
>> Gotcha. So, if I want to see whether or not some given piece of content
>> is in a repository, then I need to iterate through every RepositoryContent
>> related to a given RepositoryVersion, and check to see if any have a
>> non-null version_added and a null version_removed, right?
> That's already done and isolated in one place. You would just access
> myversion.content() to get a queryset, and use it like any other. There
> should be no need for a plugin writer to see or understand the join logic,
> regardless of what that logic is.
> Michael Hrivnak
> Principal Software Engineer, RHCE
> Red Hat
> Pulp-dev mailing list
> Pulp-dev at redhat.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Pulp-dev