<div dir="ltr"><div class="gmail-gE gmail-iv gmail-gt"><table class="gmail-cf gmail-gJ" cellpadding="0"><tbody><tr class="gmail-acZ gmail-xD"><td colspan="3"><table class="gmail-cf gmail-adz" cellpadding="0"><tbody><tr><td class="gmail-ady"><div class="gmail-ajy" tabindex="0"><img class="gmail-ajz" id="gmail-:2ke" src="https://mail.google.com/mail/u/0/images/cleardot.gif" alt=""></div></td></tr></tbody></table></td></tr></tbody></table></div><div id="gmail-:1ix"><div class="gmail-qQVYZb"></div><div class="gmail-utdU2e"></div><div class="gmail-btm"></div></div><div class="gmail-"><div class="gmail-aHl"></div><div id="gmail-:1kw" tabindex="-1"></div><div id="gmail-:2m9" class="gmail-ii gmail-gt"><div id="gmail-:2m8" class="gmail-a3s gmail-aXjCH gmail-m1672e318cd88f1ba" tabindex="-1"><div dir="ltr"><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid5"><span>TL;DR:
the Pulp 3 single-table-content performance work actually may be slower
than pulp3 master already so we're not going to merge until we're sure
it's faster. Either way, there are some API usability changes described
at the bottom that we think make sense to do either way. A call for
comment is at the bottom.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid6"></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid7"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid8"><span>A
few months back, jsherrill from the Katello project performed some
performance tests on Pulp 3 and found that (at the time), Pulp 3 was
slower than Pulp 2 by a significant margin. To address this, we
determined there were three major improvements that could be made:</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid9"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid10"><ul><li><span>To make the RepositoryVersion add_content() and remove_content() methods take querysets instead of single content units</span></li></ul></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid11"><ul><li><span>To
use Django's "bulk_create" method with Artifact model objects, which
inserts multiple models into the database in a single query</span></li></ul></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid12"><ul><li><span>To remove multi-table model inheritance from the Content model so that it can also be used with bulk_create</span></li></ul></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid13"><span> The
first two changes were fairly trivial, and were implemented within a
week. The last change, making the Content models bulk_create compatible,
is very much not trivial and involves a number of API changes all
throughout Pulp. Some of these are clear usability wins, some are more
neutral, and some make doing things in pulpcore much more difficult.
Brian and I started this work about 5 weeks ago, and recently I have
been testing it's performance against both Pulp 2 and standard (master
branch) Pulp 3.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid14"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid15"><span>In
the course of that performance testing, I've found that the performance
gap between Pulp 3 (master branch) and Pulp 2 has already disappeared,
and that we're either tying or beating Pulp 2 for all of the benchmarks I
tried.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid16"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid17"><span>You can see some of these results here: </span><span><a href="https://pulp.plan.io/issues/3770#note-15" target="_blank">https://pulp.plan.io/issues/37<wbr>70#note-15</a></span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid18"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid19"><span>I
also found that, unfortunately, the single-table content branch of Pulp
is not currently faster, and in fact is slower by about 30-40%.
Results here: </span><span><a href="https://paste.fedoraproject.org/paste/4PG5l7fRonioBSNBOFnw9Q" target="_blank">https://paste.fedoraproject.or<wbr>g/paste/4PG5l7fRonioBSNBOFnw9Q</a></span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid20"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid21"><span> I've
attached cProfile-generated call graphs but, in summary, despite
spending much less time inside the database relative to master branch
(due to using bulk_create for inserts), it seems that with very large
repositories, the queries that are generated are so large and complex
that the time spent generating and compiling the SQL takes longer than
the queries themselves. This is likely due to the use of
GenericForeignKey, which many people in the Django community seem to
take a dim view of (however, it's really the only reasonable alternative
to multi-table inheritance in our specific use case).</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid22"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid23"><span>Notably,
this is only for large very quantities of content - a sync of 20k files
is roughly as fast on either branch, and a sync of 70k files is
slower. As seen in the results.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid24"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid25"><span>Brian
has volunteered to take a deep look at the single-table content code to
see if the performance issues are resolvable or if they are intrinsic
to the way GenericForeignKey works. In the meantime, there are various
things we should take into consideration as we think about how to move
forward</span><span> if single-content is not merged because it can't be made faster</span><span>.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid26"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid27"><span>Some
of the API changes that are required by single-table-content would be
beneficial even if we didn't go forwards with the modelling changes.
For instance, currently we have single endpoints for each of
repository_version/.../content<wbr>/, .../added_content/, and
.../removed_content/ which mix content of all types together. This
makes it impossible for clients to expect the data returned to expect
any particular schema. What the single-table-content does is to provide
separate query urls for each content type present in the repository
version, which I believe is a usability win for us, and it's something
we could implement without using any of the modelling changes.</span></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid28"><br></div><div id="gmail-m_-4136219612285385541m_6663064853145957797gmail-magicdomid29"><span>Besides being a general update, </span><span>I'd like to start a discussion</span><span>
to understand: is changing the Pulp 3 API so that it's organized
around content type URLs OK with everyone? This resolves the usability
issues of returning mixed types. Are there any downsides with this
approach?</span></div><div><span><br></span></div><div><span>To clarify what I mean on that last point -- by "content type URLs" I mean that where you currently get back the url "<span>/pulp/api/v3/repository_versi<wbr>on/.../content/</span>" under the "_content" field on a repoversion, you would instead get back something like <br></span></div><div><span><br></span></div><div><span>{ "pulp_file.filecontent": "/pulp/api/v3/content/file/fil<wbr>es/?repository_version=.. }</span></div></div></div></div></div><br></div>