[Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Wed Feb 27 00:22:39 UTC 2019

Hello all,

We've had an ongoing discussion about whether Pulp would be able to perform
acceptably if we switched back to UUID primary keys.  I've finished doing
the performance testing and I *think* the answer is yes.  Although to be
honest, I'm not sure that I understand why, in the case of MariaDB.

I linked my testing methodology and results here:
https://pulp.plan.io/issues/4290#note-18

To summarize, I tested the following:

* How long it takes to perform subsequent large (lazy) syncs, with lots of
content in the database (100-400k content units)
* How long it takes to perform various small but important database queries

The results were weirdly in contrast in some cases.

The first four syncs (202,000 content total) behaved mostly the same on
PostgreSQL whether it used an autoincrement or UUID primary key.
Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
the amont of content in the repository in both cases, which was a bit
surprising to me.  The size of the database at the end was 30-40% larger
with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
typical usage when you consider that most content types have more metadata
than FileContent (what I was testing).

Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form
https://www.diffchecker.com/40AF8vvM

With MariaDB the first sync was almost 80% slower than the first sync w/
PostgreSQL, but every subsequent sync was as fast or faster, despite the
tests of specific queries performing multiple times worse.  Additionally
the sync performance did not decrease as rapidly as it did under
PostgreSQL.  With MariaDB, one of my test queries that worked fine when
backed by PostgreSQL ended up hanging endlessly and I had to cut it off
after 25 or so minutes. [0]  I would consider that a blocker to claiming we
support MariaDB / MySQL.

But overall I'm not sure how to interpret the fact that on one hand the
real-usage performance is equal or better better, and on the performance of
some of the underlying queries is noticably worse.  Maybe there's some
weird caching going on in the backend, or the generated indexes are
different?

UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
https://www.diffchecker.com/W1nnIQgj

I'd like to invite some discussion on this, but nothing I've mentioned
seems like it would be a problem for going forwards with using UUID primary
keys in a general sense.  If we're all in agreement about that engineering
decision then we can move forwards with that work.

[0] for *some* but not all repository versions.  No idea what's up there.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190226/675d186c/attachment.htm>