[Pulp-dev] Concerns about bulk_create and PostgreSQL

Wed Dec 5 14:27:12 UTC 2018

It looks like the chart was generated using MySQL 5.0.45 which was released
at least 10 years ago[0]. I don't think we can rely on such old results.

[0] https://en.wikipedia.org/wiki/MySQL#Milestones

On Wed, Dec 5, 2018 at 9:18 AM Daniel Alley <dalley at redhat.com> wrote:

> I just want to point out that using UUID PKs works perfectly fine on
> PostgreSQL but is considered a Bad Idea*™* on MySQL for performance
> reasons.
>
> http://kccoder.com/mysql/uuid-vs-int-insert-performance/
>
> [image: image.png]
>
> It's hard to notice at first, but the blue and red lines (representing
> integer PKs) are tracking near the bottom.
>
> I did my testing with PostgreSQL, and I would completely agree that the
> tiny performance hit we noticed there would take a backseat to the
> functional benefits Brian is pointing out.  But if we really, truly want to
> be database agnostic, we should put more thought into this change (and
> others going forwards).
>
> Another factor that makes this a more complicated decision is that the
> limitations on using bulk_create() with multi-table models are more of a
> "simplification" on the Django side than a fundamental limitation.
> According to this comment [0] in the Django source code, and this issue [1]
> it's likely possible on PostgreSQL as-is, if we were willing to mess around
> inside the ORM a bit.  And it could be possible on MySQL also *if* we used
> UUID PKs.  And maybe the performance benefits of being able to use
> bulk_create() would override or reduce the performance downsides of using
> UUID with MySQL.  I don't know about that though... that's chart chart and
> without some experimentation this is all speculation.
>
> TL;DR If we want to stay DB agnostic it needs to be worked into our
> decision making process and not be an afterthought
>
> [0]
> https://github.com/django/django/blob/master/django/db/models/query.py#L438
> [1] https://code.djangoproject.com/ticket/28821
>
>
> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <pcreech at redhat.com>
> wrote:
>
>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>> > When we switched from UUID to integers for the PK
>> > with databases other than PostgreSQL [0].
>> >
>> > With a goal of database agnosticism for Pulp3, if plugin writers plan
>> to use bulk_create with any object inherited
>> > from one of ours, they can't will get different behaviors on different
>> databases and they won't have PKs that they may
>> > require. bulk_create is a normal django thing, so plugin writers making
>> a django plugin should be able to use it. This
>> > concerned me already, but today it was also brought up by non-RH plugin
>> writers also [1] in a PR.
>> >
>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in our
>> ticket where we discussed that change [2].
>> > Note, we did not consider this bulk_create downside at that time, which
>> I think is the most significant downside to
>> > consider.
>> >
>> > Having bulk_create effectively not available for plugin writers (since
>> we can't rely on its pks being returned) I
>> > think is a non-starter for me. I love how short the UUIDs made our URLs
>> so that's the tradeoff mainly in my mind.
>> > Those balanced against each other, I think we should switch back.
>> >
>> > Another option is to become PostgreSQL only which (though I love psql)
>> I think would be the wrong choice for Pulp from
>> > what I've heard from its users.
>> >
>> > What do you think? What should we do?
>>
>> So, my mind immediately goes to this question, which might be usefull for
>> others to help make decisions, so I'll ask:
>>
>> When you say:
>>
>> "we lost the ability to have the primary key set during bulk_create"
>>
>> Can you clarify what you mean by this?
>>
>> My mind immediately goes to this chain of events:
>>
>>         When you use bulk_create, the existing in-memory model objects
>> representing the data to create do not get
>> updated with the primary key values that are created in the database.
>>
>>         Upon a subsequent query of the database, for the exact same set
>> of objects just added, those objects _will_ have
>> the primary key populated.
>>
>> In other words,
>>
>>         The database records themselves get the auto-increment IDs added,
>> they just don't get reported back in that
>> query to the ORM layer, therefore it takes a subsequent query to get
>> those ids out.
>>
>> Does that about sum it up?
>>
>>
>> >
>> > [0]:
>> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
>> > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
>> > [2]: https://pulp.plan.io/issues/3848
>> > _______________________________________________
>> > Pulp-dev mailing list
>> > Pulp-dev at redhat.com
>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181205/3e6009fa/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 241033 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20181205/3e6009fa/attachment.png>