[Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Tue Mar 5 22:13:37 UTC 2019

+1 to switching back to UUIDs for the reasons Brian gave.

On 3/1/19 2:23 PM, Brian Bouterse wrote:
> I've finally gotten to read through the numbers and this thread. It is 
> a tradeoff but I am +1 for switching to UUIDs. I focus on the 
> PostgreSQL UUID vs int case because that is our default database. I 
> don't think too much about how things perform on MariaDB because they 
> can improve their own performance to catch up to PostgreSQL which 
> regularly is performing better afaict. I agree with the assessment of 
> 30% ish slowdown in the large unit cases for PostgreSQL. Still, I 
> believe the advantages of switching to UUIDs are worth it. Two main 
> reasons stick out in my mind.
>
> 1. Our core code and all plugin code will always be compatible with 
> common db backends even when using bulk_create()
> 2. We get database sharding with postgresql which you can only do with 
> UUID pks. I was advised this years ago by jcline.
>
> Performance and compatibility are a pretty classic trade-off. Overall 
> I've found that initial releases launch with less performance and 
> improve (often significantly) overtime. Consider the interpreter pypy 
> (not pypi).  It started "roughly 2000x slower [at initial launch] than 
> CPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 
> 30% slower in the worst-case but runs everywhere with zero 
> "db-behavior surprises" I think is worth it. Also conversely, if we 
> don't adopt UUIDs, how will we address item 1 pre RC?
>
> @dawalker for the "can we have both" option, we probably can have some 
> db-specific codepaths, but I don't think doing an application wide PK 
> type change as a setting is feasible to support. The db specific 
> codepaths are one way performance improves over time. For the initial 
> release, to keep things simple I hope we don't have conditional 
> database codepaths (for now).
>
> More discussion on this change is encouraged. Thanks @dalley so much 
> for all the detailed investigation!
>
> [0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>
> Thank you,
> Brian
>
> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <dawalker at redhat.com 
> <mailto:dawalker at redhat.com>> wrote:
>
>     As I brought up on irc, I don't know how feasible the
>     complications to maintenance would be going forward, but I would
>     prefer if we could use some sort of settings in order to choose
>     uuid or id based on MariaDB or PostgreSQL.  I want us to work
>     everywhere, but I'm really concerned at the impact to our users of
>     a 30-40% efficiency drop in speed and storage.
>
>     David wrote up a quick Proof of Concept after I brought this up
>     but wasn't necessarily advocating it himself.  I think Daniel and
>     Dennis expressed some concerns.  I'd like to see more people
>     discussing it here with reasoning/examples on how doable something
>     like this could be?
>
>     If it's not on the table, I understand, but want to make sure
>     we've considered all reasonable options, and that might not be a
>     simple binary of either/or.
>
>     Thanks,
>
>     --Dana
>
>     Dana Walker
>
>     Associate Software Engineer
>
>     Red Hat
>
>     <https://www.redhat.com>
>
>     <https://red.ht/sig>
>
>
>
>     On Fri, Mar 1, 2019 at 9:15 AM David Davis <daviddavis at redhat.com
>     <mailto:daviddavis at redhat.com>> wrote:
>
>         I just want to bump this thread. If we hope to make the Pulp 3
>         RC date, we need feedback today.
>
>         David
>
>
>         On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri
>         <mpusater at redhat.com <mailto:mpusater at redhat.com>> wrote:
>
>             Not sure if https://www.webyog.com/ Monyog will give a
>             free opensource project license.  But that might help
>             diagnose the MariaDB performance.  Monyog is really nice,
>             I wish it supported Postgres.
>
>             Matt P.
>
>             On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley
>             <dalley at redhat.com <mailto:dalley at redhat.com>> wrote:
>
>                 Hello all,
>
>                 We've had an ongoing discussion about whether Pulp
>                 would be able to perform acceptably if we switched
>                 back to UUID primary keys.  I've finished doing the
>                 performance testing and I *think* the answer is yes. 
>                 Although to be honest, I'm not sure that I understand
>                 why, in the case of MariaDB.
>
>                 I linked my testing methodology and results here:
>                 https://pulp.plan.io/issues/4290#note-18
>
>                 To summarize, I tested the following:
>
>                 * How long it takes to perform subsequent large (lazy)
>                 syncs, with lots of content in the database (100-400k
>                 content units)
>                 * How long it takes to perform various small but
>                 important database queries
>
>                 The results were weirdly in contrast in some cases.
>
>                 The first four syncs (202,000 content total) behaved
>                 mostly the same on PostgreSQL whether it used an
>                 autoincrement or UUID primary key. Subsequent syncs
>                 had a performance drop of between 30-40%.  Likewise,
>                 the code snippets performed 30+% worse.  Sync time
>                 scaled linearly"ish" with the amont of content in the
>                 repository in both cases, which was a bit surprising
>                 to me.  The size of the database at the end was 30-40%
>                 larger with UUID primary keys, 736 MB vs 521 MB.  The
>                 gap would be smaller in typical usage when you
>                 consider that most content types have more metadata
>                 than FileContent (what I was testing).
>
>                 Autoincrement PostgreSQL (left) vs. UUID PostgreSQL
>                 (right) in diff form
>                 https://www.diffchecker.com/40AF8vvM
>
>                 With MariaDB the first sync was almost 80% slower than
>                 the first sync w/ PostgreSQL, but every subsequent
>                 sync was as fast or faster, despite the tests of
>                 specific queries performing multiple times worse. 
>                 Additionally the sync performance did not decrease as
>                 rapidly as it did under PostgreSQL.  With MariaDB, one
>                 of my test queries that worked fine when backed by
>                 PostgreSQL ended up hanging endlessly and I had to cut
>                 it off after 25 or so minutes. [0] I would consider
>                 that a blocker to claiming we support MariaDB / MySQL.
>
>                 But overall I'm not sure how to interpret the fact
>                 that on one hand the real-usage performance is equal
>                 or better better, and on the performance of some of
>                 the underlying queries is noticably worse.  Maybe
>                 there's some weird caching going on in the backend, or
>                 the generated indexes are different?
>
>                 UUID PostgreSQL (left) vs. UUID MariaDB (right) in
>                 diff form
>                 https://www.diffchecker.com/W1nnIQgj
>
>                 I'd like to invite some discussion on this, but
>                 nothing I've mentioned seems like it would be a
>                 problem for going forwards with using UUID primary
>                 keys in a general sense.  If we're all in agreement
>                 about that engineering decision then we can move
>                 forwards with that work.
>
>                 [0] for *some* but not all repository versions.  No
>                 idea what's up there.
>
>
>
>
>
>
>
>                 _______________________________________________
>                 Pulp-dev mailing list
>                 Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>                 https://www.redhat.com/mailman/listinfo/pulp-dev
>
>             _______________________________________________
>             Pulp-dev mailing list
>             Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>             https://www.redhat.com/mailman/listinfo/pulp-dev
>
>         _______________________________________________
>         Pulp-dev mailing list
>         Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>         https://www.redhat.com/mailman/listinfo/pulp-dev
>
>     _______________________________________________
>     Pulp-dev mailing list
>     Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>     https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190305/6cfe5187/attachment.htm>