[Pulp-dev] Performance testing results, autoincrement ID vs UUID primary keys

Fri Mar 1 21:52:28 UTC 2019

I'm less concerned with the difference between autoincrement vs. UUID
speed, and more concerned with how quickly performance was getting worse
with database size on PostgreSQL in both cases (and not on MariaDB
strangely).  There's probably a *lot* that can be done to improve
performance that we just haven't even looked into yet, and a few weeks of
effort on that front would make a much larger difference (probably) than
the type of PK.  Not that we should disregard it entirely.  The PK decision
has to be made soon though and the work I mentioned will have to wait a bit.

But I do think that If we *really* wanted to support MySQL/MariaDB while
retaining autoincrement PKs, the best option would be a small
MySQL-specific reimplementation of "bulk_create()" that would just call
.save() on all the objects in a loop.  It would probably be *much* slower
for MySQL, but it would be fairly simple (only a couple of lines), it would
work for both without compromising PostgreSQL performance, it would avoid
making the docs more confusing to users and it would be a lot less risky.

@Brian  Would sharding actually be valuable? Have any Pulp users approached
the sort of scale where it would be the right thing to do.  From what I've
heard, a single PostgreSQL installation is capable of handling 20 Terabytes
without tremendous issue. I can't imagine Pulp's database growing so large
that it would be more economical to manage a second database server than it
would be to add more storage to the server you have. I can be convinced
otherwise though.

I think a more compelling point 2 would be that in the multi-tenant use
case, UUIDs would make it vastly more difficult for one API user to gather
information on another user than autoincrement PKs.  Which, even though
we're not going to handle multi-tenant out of the gate, is a reasonable
thing to think about and possibly a good reason to go in that direction.

On Fri, Mar 1, 2019 at 3:24 PM Brian Bouterse <bbouters at redhat.com> wrote:

> I've finally gotten to read through the numbers and this thread. It is a
> tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID
> vs int case because that is our default database. I don't think too much
> about how things perform on MariaDB because they can improve their own
> performance to catch up to PostgreSQL which regularly is performing better
> afaict. I agree with the assessment of 30% ish slowdown in the large unit
> cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs
> are worth it. Two main reasons stick out in my mind.
>
> 1. Our core code and all plugin code will always be compatible with common
> db backends even when using bulk_create()
> 2. We get database sharding with postgresql which you can only do with
> UUID pks. I was advised this years ago by jcline.
>
> Performance and compatibility are a pretty classic trade-off. Overall I've
> found that initial releases launch with less performance and improve (often
> significantly) overtime. Consider the interpreter pypy (not pypi).  It
> started "roughly 2000x slower [at initial launch] than CPython, to roughly
> 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the
> worst-case but runs everywhere with zero "db-behavior surprises" I think is
> worth it. Also conversely, if we don't adopt UUIDs, how will we address
> item 1 pre RC?
>
> @dawalker for the "can we have both" option, we probably can have some
> db-specific codepaths, but I don't think doing an application wide PK type
> change as a setting is feasible to support. The db specific codepaths are
> one way performance improves over time. For the initial release, to keep
> things simple I hope we don't have conditional database codepaths (for now).
>
> More discussion on this change is encouraged. Thanks @dalley so much for
> all the detailed investigation!
>
> [0]: https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html
>
> Thank you,
> Brian
>
> On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <dawalker at redhat.com> wrote:
>
>> As I brought up on irc, I don't know how feasible the complications to
>> maintenance would be going forward, but I would prefer if we could use some
>> sort of settings in order to choose uuid or id based on MariaDB or
>> PostgreSQL.  I want us to work everywhere, but I'm really concerned at the
>> impact to our users of a 30-40% efficiency drop in speed and storage.
>>
>> David wrote up a quick Proof of Concept after I brought this up but
>> wasn't necessarily advocating it himself.  I think Daniel and Dennis
>> expressed some concerns.  I'd like to see more people discussing it here
>> with reasoning/examples on how doable something like this could be?
>>
>> If it's not on the table, I understand, but want to make sure we've
>> considered all reasonable options, and that might not be a simple binary of
>> either/or.
>>
>> Thanks,
>>
>> --Dana
>>
>> Dana Walker
>>
>> Associate Software Engineer
>>
>> Red Hat
>>
>> <https://www.redhat.com>
>> <https://red.ht/sig>
>>
>>
>> On Fri, Mar 1, 2019 at 9:15 AM David Davis <daviddavis at redhat.com> wrote:
>>
>>> I just want to bump this thread. If we hope to make the Pulp 3 RC date,
>>> we need feedback today.
>>>
>>> David
>>>
>>>
>>> On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri <mpusater at redhat.com>
>>> wrote:
>>>
>>>> Not sure if https://www.webyog.com/ Monyog will give a free opensource
>>>> project license.  But that might help diagnose the MariaDB performance.
>>>> Monyog is really nice, I wish it supported Postgres.
>>>>
>>>> Matt P.
>>>>
>>>> On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley <dalley at redhat.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> We've had an ongoing discussion about whether Pulp would be able to
>>>>> perform acceptably if we switched back to UUID primary keys.  I've finished
>>>>> doing the performance testing and I *think* the answer is yes.  Although to
>>>>> be honest, I'm not sure that I understand why, in the case of MariaDB.
>>>>>
>>>>> I linked my testing methodology and results here:
>>>>> https://pulp.plan.io/issues/4290#note-18
>>>>>
>>>>> To summarize, I tested the following:
>>>>>
>>>>> * How long it takes to perform subsequent large (lazy) syncs, with
>>>>> lots of content in the database (100-400k content units)
>>>>> * How long it takes to perform various small but important database
>>>>> queries
>>>>>
>>>>> The results were weirdly in contrast in some cases.
>>>>>
>>>>> The first four syncs (202,000 content total) behaved mostly the same
>>>>> on PostgreSQL whether it used an autoincrement or UUID primary key.
>>>>> Subsequent syncs had a performance drop of between 30-40%.  Likewise, the
>>>>> code snippets performed 30+% worse.  Sync time scaled linearly"ish" with
>>>>> the amont of content in the repository in both cases, which was a bit
>>>>> surprising to me.  The size of the database at the end was 30-40% larger
>>>>> with UUID primary keys, 736 MB vs 521 MB.  The gap would be smaller in
>>>>> typical usage when you consider that most content types have more metadata
>>>>> than FileContent (what I was testing).
>>>>>
>>>>> Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff
>>>>> form
>>>>> https://www.diffchecker.com/40AF8vvM
>>>>>
>>>>> With MariaDB the first sync was almost 80% slower than the first sync
>>>>> w/ PostgreSQL, but every subsequent sync was as fast or faster, despite the
>>>>> tests of specific queries performing multiple times worse.  Additionally
>>>>> the sync performance did not decrease as rapidly as it did under
>>>>> PostgreSQL.  With MariaDB, one of my test queries that worked fine when
>>>>> backed by PostgreSQL ended up hanging endlessly and I had to cut it off
>>>>> after 25 or so minutes. [0]  I would consider that a blocker to claiming we
>>>>> support MariaDB / MySQL.
>>>>>
>>>>> But overall I'm not sure how to interpret the fact that on one hand
>>>>> the real-usage performance is equal or better better, and on the
>>>>> performance of some of the underlying queries is noticably worse.  Maybe
>>>>> there's some weird caching going on in the backend, or the generated
>>>>> indexes are different?
>>>>>
>>>>> UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form
>>>>> https://www.diffchecker.com/W1nnIQgj
>>>>>
>>>>> I'd like to invite some discussion on this, but nothing I've mentioned
>>>>> seems like it would be a problem for going forwards with using UUID primary
>>>>> keys in a general sense.  If we're all in agreement about that engineering
>>>>> decision then we can move forwards with that work.
>>>>>
>>>>> [0] for *some* but not all repository versions.  No idea what's up
>>>>> there.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190301/9288ebeb/attachment.htm>