[Pulp-dev] Concerns about bulk_create and PostgreSQL

Thu Jan 3 18:02:57 UTC 2019

I don't think that using integer ids with bulk_create and supporting
mysql/mariadb are necessarily mutually exclusive. I think there might be a
way to find the records created using bulk_create if we know the natural
key. It might be more performant than using UUIDs as well.

David

On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <dkliban at redhat.com> wrote:

> Thank you Daniel for the explanation and for filing an issue[0] to do
> performance analysis of UUIDs.
>
> I really hope that we can switch back to using UUIDs so we can bring back
> MariaDB support for Pulp 3.
>
> [0] https://pulp.plan.io/issues/4290
>
> On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <dalley at redhat.com> wrote:
>
>> To rephrase the problem a little bit:
>>
>> We need to bulk_create() a bunch of objects, and then after we do that we
>> want to immediately be able to relate them with other objects, which means
>> we need their PKs of the objects that were just created.
>>
>> In the case of auto-increment integer PKs, we can't know that PK value
>> before it gets saved into the database.  Luckily, PostgreSQL (and Oracle)
>> support a "RETURNING" keyword that does provides this information.  The raw
>> SQL would look something like this:
>>
>>> INSERT INTO items (name) values ('bear') RETURNING id;
>>>
>>> Django uses this feature to set the PK field on the model objects it
>> returns when you call bulk_create() on a list of unsaved model objects.
>>
>> Unfortunately, MySQL doesn't support this, so there's no way to figure
>> out what the PKs of the objects you just saved were, so the ORM can't set
>> that information on the returned model objects.
>>
>> UUID PKs circumvent this because the PK gets created outside of the
>> database, prior to being saved in the database, and so Django *can* know
>> what the PK will be when it gets saved.
>>
>> On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <bbouters at redhat.com>
>> wrote:
>>
>>> +1 to experimentation and also making sure that we understand the
>>> performance implications of the decision. I'm replying to this earlier note
>>> to restate my observations of the problem a bit more.
>>>
>>> More ideas and thoughts are welcome. This is a decision with a lot of
>>> aspects to consider.
>>>
>>>
>>> On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <pcreech at redhat.com>
>>> wrote:
>>>
>>>> On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>>>> > When we switched from UUID to integers for the PK
>>>> > with databases other than PostgreSQL [0].
>>>> >
>>>> > With a goal of database agnosticism for Pulp3, if plugin writers plan
>>>> to use bulk_create with any object inherited
>>>> > from one of ours, they can't will get different behaviors on
>>>> different databases and they won't have PKs that they may
>>>> > require. bulk_create is a normal django thing, so plugin writers
>>>> making a django plugin should be able to use it. This
>>>> > concerned me already, but today it was also brought up by non-RH
>>>> plugin writers also [1] in a PR.
>>>> >
>>>> > The tradeoffs bteween UUIDs versus PKs are pretty well summed up in
>>>> our ticket where we discussed that change [2].
>>>> > Note, we did not consider this bulk_create downside at that time,
>>>> which I think is the most significant downside to
>>>> > consider.
>>>> >
>>>> > Having bulk_create effectively not available for plugin writers
>>>> (since we can't rely on its pks being returned) I
>>>> > think is a non-starter for me. I love how short the UUIDs made our
>>>> URLs so that's the tradeoff mainly in my mind.
>>>> > Those balanced against each other, I think we should switch back.
>>>> >
>>>> > Another option is to become PostgreSQL only which (though I love
>>>> psql) I think would be the wrong choice for Pulp from
>>>> > what I've heard from its users.
>>>> >
>>>> > What do you think? What should we do?
>>>>
>>>> So, my mind immediately goes to this question, which might be usefull
>>>> for others to help make decisions, so I'll ask:
>>>>
>>>> When you say:
>>>>
>>>> "we lost the ability to have the primary key set during bulk_create"
>>>>
>>>> Can you clarify what you mean by this?
>>>>
>>>> My mind immediately goes to this chain of events:
>>>>
>>>>         When you use bulk_create, the existing in-memory model objects
>>>> representing the data to create do not get
>>>> updated with the primary key values that are created in the database.
>>>>
>>>>         Upon a subsequent query of the database, for the exact same set
>>>> of objects just added, those objects _will_ have
>>>> the primary key populated.
>>>>
>>>> In other words,
>>>>
>>>>         The database records themselves get the auto-increment IDs
>>>> added, they just don't get reported back in that
>>>> query to the ORM layer, therefore it takes a subsequent query to get
>>>> those ids out.
>>>>
>>>> Does that about sum it up?
>>>>
>>>
>>> Yes this describes the situation, but there is a bit more to tell. Since
>>> PostgreSQL does return the ids the subsequent query that could be done to
>>> get the ids isn't written in code today. We didn't need to because we
>>> developed it against PostgreSQL. I'm pretty sure that if you configure Pulp
>>> against MySQL Pulp won't work, which I think is a problem. So I'm observing
>>> two things here. 1) This is a hazard that causes code to unexpectedly be
>>> only compliant with PostgreSQL. 2) Pulp itself fell into this hazard and we
>>> need to fix that too
>>>
>>> Do you also see these two issues? What should be done about these?
>>>
>>>
>>>>
>>>> >
>>>> > [0]:
>>>> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
>>>> > [1]: https://github.com/pulp/pulp/pull/3764#discussion_r234780702
>>>> > [2]: https://pulp.plan.io/issues/3848
>>>> > _______________________________________________
>>>> > Pulp-dev mailing list
>>>> > Pulp-dev at redhat.com
>>>> > https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190103/a6c74422/attachment.htm>