[Pulp-dev] Single-Table Content API Changes, Performance Discussion

Daniel Alley dalley at redhat.com
Mon Nov 19 23:03:17 UTC 2018

TL;DR: the Pulp 3 single-table-content performance work actually may be
slower than pulp3 master already so we're not going to merge until we're
sure it's faster. Either way, there are some API usability changes
described at the bottom that we think make sense to do either way. A call
for comment is at the bottom.

A few months back, jsherrill from the Katello project performed some
performance tests on Pulp 3 and found that (at the time), Pulp 3 was slower
than Pulp 2 by a significant margin.  To address this, we determined there
were three major improvements that could be made:

   - To make the RepositoryVersion add_content() and remove_content()
   methods take querysets instead of single content units

   - To use Django's "bulk_create" method with Artifact model objects,
   which inserts multiple models into the database in a single query

   - To remove multi-table model inheritance from the Content model so that
   it can also be used with bulk_create

 The first two changes were fairly trivial, and were implemented within a
week. The last change, making the Content models bulk_create compatible, is
very much not trivial and involves a number of API changes all throughout
Pulp.  Some of these are clear usability wins, some are more neutral, and
some make doing things in pulpcore much more difficult.  Brian and I
started this work about 5 weeks ago, and recently I have been testing it's
performance against both Pulp 2 and standard (master branch) Pulp 3.

In the course of that performance testing, I've found that the performance
gap between Pulp 3 (master branch) and Pulp 2 has already disappeared, and
that we're either tying or beating Pulp 2 for all of the benchmarks I tried.

You can see some of these results here: https://pulp.plan.io/issues/37

I also found that, unfortunately, the single-table content branch of Pulp
is not currently faster, and in fact is slower by about 30-40%.   Results
here:  https://paste.fedoraproject.org/paste/4PG5l7fRonioBSNBOFnw9Q

 I've attached cProfile-generated call graphs but, in summary, despite
spending much less time inside the database relative to master branch (due
to using bulk_create for inserts), it seems that with very large
repositories, the queries that are generated are so large and complex that
the time spent generating and compiling the SQL takes longer than the
queries themselves.  This is likely due to the use of GenericForeignKey,
which many people in the Django community seem to take a dim view of
(however, it's really the only reasonable alternative to multi-table
inheritance in our specific use case).

Notably, this is only for large very quantities of content - a sync of 20k
files is roughly as fast on either branch, and a sync of 70k files is
slower.  As seen in the results.

Brian has volunteered to take a deep look at the single-table content code
to see if the performance issues are resolvable or if they are intrinsic to
the way GenericForeignKey works.  In the meantime, there are various things
we should take into consideration as we think about how to move forward if
single-content is not merged because it can't be made faster.

Some of the API changes that are required by single-table-content would be
beneficial even if we didn't go forwards with the modelling changes.  For
instance, currently we have single endpoints for each of
repository_version/.../content/,  .../added_content/, and
.../removed_content/ which mix content of all types together.  This makes
it impossible for clients to expect the data returned to expect any
particular schema.  What the single-table-content does is to provide
separate query urls for each content type present in the repository
version, which I believe is a usability win for us, and it's something we
could implement without using any of the modelling changes.

Besides being a general update, I'd like to start a discussion to
understand:  is changing the Pulp 3 API so that it's organized around
content type URLs OK with everyone? This resolves the usability issues of
returning mixed types. Are there any downsides with this approach?

To clarify what I mean on that last point -- by "content type URLs" I mean
that where you currently get back the url "/pulp/api/v3/repository_versi
on/.../content/" under the "_content" field on a repoversion, you would
instead get back something like

{ "pulp_file.filecontent":
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181119/64fb4569/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: benchmarking_results.zip
Type: application/zip
Size: 4800161 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181119/64fb4569/attachment.zip>

More information about the Pulp-dev mailing list