[Pulp-dev] Repo version implementation

Tue Dec 12 17:26:41 UTC 2017

I expect both options to have equal ease of use for plugin writers.

In both cases, I would expect the RepositoryVersion object to have a
"content" attribute that returns a QuerySet. That's what the PR does
currently, and the other approach could easily do the same.

For adding and removing content, most plugins will let the core do that for
them by using changesets. Any plugins that choose the DIY approach will do
one of the following depending on which option is chosen:

my_version.content.add(piece_of_content)
my_version.add_content(piece_of_content)

I don't think either places a burden on the plugin writer.

If option 1 is chosen, some thought will be needed around where/when all
the new relationships get made between a new version and its content. Would
the core create an empty version and expect the plugin to fully populate it
each time? Or would the core create a new version with the same content set
as its predecessor, and then let the plugin add/remove as necessary?

As for comparing versions, option 2 makes that very easy. Tracking the
changes directly makes it easy to report on those changes quickly and
efficiently.

For background, option 2 was created to accommodate the most common use
case, and the one where our users have proven most performance-sensitive:
after an initial large sync of a repo, additional content trickles in as a
series of small changes (think bug fixes on a RHEL release). The changes
need to be fast to write (during sync) and fast to read (incremental
publish, incremental applicability calculation, etc). Either approach will
likely work fine on a lightly-loaded system. But in a heavily-loaded
environment similar to where we see Pulp 2 often running, you likely would
see a meaningful difference between 10 inserts and 10,000 inserts.

The other motivation was the issue of scale. Postgresql is a great
database, but lots of data is lots of data. Consider a user with 10 repos,
10k content units in each, and 10 versions of each. That's a very small use
case, and already would be 1M associations under option 1. As any of those
numbers increase, you quickly get to hundreds of millions of associations
for even a medium-sized deployment, which can have real impact on query
performance, index size (you want your index in RAM when possible), index
updates, not to mention the time it takes for a database backup (or
restore!). So if you want to go with option 1, I encourage seeking
realistic performance expectations first.

I'm happy to make the last few updates to the PR for option 2, but I
suppose I should wait for this discussion to come to a conclusion first.
Keep me posted if you want to green-light option 2.

On Tue, Dec 12, 2017 at 9:28 AM, Jeremy Audet <jaudet at redhat.com> wrote:

>
>>>
> Gotcha. So, if I want to see whether or not some given piece of content is
> in a repository, then I need to iterate through every RepositoryContent
> related to a given RepositoryVersion, and check to see if any have a
> non-null version_added and a null version_removed, right?
>
>
That's already done and isolated in one place. You would just access
myversion.content() to get a queryset, and use it like any other. There
should be no need for a plugin writer to see or understand the join logic,
regardless of what that logic is.

-- 

Michael Hrivnak

Principal Software Engineer, RHCE

Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20171212/4a66c10d/attachment.htm>