[Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Brian Bouterse bbouters at redhat.com
Fri Mar 1 20:23:10 UTC 2019

I've finally gotten to read through the numbers and this thread. It is a
tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
vs int case because that is our default database. I don't think too much
about how things perform on MariaDB because they can improve their own
performance to catch up to PostgreSQL which regularly is performing better
afaict. I agree with the assessment of 30% ish slowdown in the large unit
cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
are worth it. Two main reasons stick out in my mind.

1. Our core code and all plugin code will always be compatible with common
db backends even when using bulk_create()
2. We get database sharding with postgresql which you can only do with UUID
pks. I was advised this years ago by jcline.

Performance and compatibility are a pretty classic trade-off. Overall I've
found that initial releases launch with less performance and improve (often
significantly) overtime. Consider the interpreter pypy (not pypi).  It
started "roughly 2000x slower [at initial launch] than CPython, to roughly
7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
worst-case but runs everywhere with zero "db-behavior surprises" I think is
worth it. Also conversely, if we don't adopt UUIDs, how will we address
item 1 pre RC?

@dawalker for the "can we have both" option, we probably can have some
db-specific codepaths, but I don't think doing an application wide PK type
change as a setting is feasible to support. The db specific codepaths are
one way performance improves over time. For the initial release, to keep
things simple I hope we don't have conditional database codepaths (for now).

More discussion on this change is encouraged. Thanks @dalley so much for
all the detailed investigation!

[0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html

Thank you,

On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <dawalker at redhat.com> wrote:

> As I brought up on irc, I don't know how feasible the complications to
> maintenance would be going forward, but I would prefer if we could use some
> sort of settings in order to choose uuid or id based on MariaDB or
> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
> impact to our users of a 30-40% efficiency drop in speed and storage.
> David wrote up a quick Proof of Concept after I brought this up but wasn't
> necessarily advocating it himself.  I think Daniel and Dennis expressed
> some concerns.  I'd like to see more people discussing it here with
> reasoning/examples on how doable something like this could be?
> If it's not on the table, I understand, but want to make sure we've
> considered all reasonable options, and that might not be a simple binary of
> either/or.
> Thanks,
> --Dana
> Dana Walker
> Associate Software Engineer
> Red Hat
> <https://www.redhat.com>
> <https://red.ht/sig>
> On Fri, Mar 1, 2019 at 9:15 AM David Davis <daviddavis at redhat.com> wrote:
>> I just want to bump this thread. If we hope to make the Pulp 3 RC date,
>> we need feedback today.
>> David
>> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri <mpusater at redhat.com>
>> wrote:
>>> Not sure if https://www.webyog.com/ Monyog will give a free opensource
>>> project license.  But that might help diagnose the MariaDB performance.
>>> Monyog is really nice, I wish it supported Postgres.
>>> Matt P.
>>> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley <dalley at redhat.com> wrote:
>>>> Hello all,
>>>> We've had an ongoing discussion about whether Pulp would be able to
>>>> perform acceptably if we switched back to UUID primary keys.  I've finished
>>>> doing the performance testing and I *think* the answer is yes.  Although to
>>>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>>> I linked my testing methodology and results here:
>>>> https://pulp.plan.io/issues/4290#note-18
>>>> To summarize, I tested the following:
>>>> * How long it takes to perform subsequent large (lazy) syncs, with lots
>>>> of content in the database (100-400k content units)
>>>> * How long it takes to perform various small but important database
>>>> queries
>>>> The results were weirdly in contrast in some cases.
>>>> The first four syncs (202,000 content total) behaved mostly the same on
>>>> PostgreSQL whether it used an autoincrement or UUID primary key.
>>>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>>>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>>>> the amont of content in the repository in both cases, which was a bit
>>>> surprising to me.  The size of the database at the end was 30-40% larger
>>>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>>>> typical usage when you consider that most content types have more metadata
>>>> than FileContent (what I was testing).
>>>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
>>>> https://www.diffchecker.com/40AF8vvM
>>>> With MariaDB the first sync was almost 80% slower than the first sync
>>>> w/ PostgreSQL, but every subsequent sync was as fast or faster, despite the
>>>> tests of specific queries performing multiple times worse.  Additionally
>>>> the sync performance did not decrease as rapidly as it did under
>>>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>>>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>>>> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
>>>> support MariaDB / MySQL.
>>>> But overall I'm not sure how to interpret the fact that on one hand the
>>>> real-usage performance is equal or better better, and on the performance of
>>>> some of the underlying queries is noticably worse.  Maybe there's some
>>>> weird caching going on in the backend, or the generated indexes are
>>>> different?
>>>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>>>> https://www.diffchecker.com/W1nnIQgj
>>>> I'd like to invite some discussion on this, but nothing I've mentioned
>>>> seems like it would be a problem for going forwards with using UUID primary
>>>> keys in a general sense.  If we're all in agreement about that engineering
>>>> decision then we can move forwards with that work.
>>>> [0] for *some* but not all repository versions.  No idea what's up
>>>> there.
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190301/e25a0716/attachment.htm>

More information about the Pulp-dev mailing list