[Pulp-dev] Concerns about bulk_create and PostgreSQL

Jeff Ortel jortel at redhat.com
Tue Jan 8 16:16:12 UTC 2019

On 1/3/19 1:28 PM, Simon Baatz wrote:
> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
>>     I don't think that using integer ids with bulk_create and supporting
>>     mysql/mariadb are necessarily mutually exclusive. I think there might
>>     be a way to find the records created using bulk_create if we know the
>>     natural key. It might be more performant than using UUIDs as well.
> This assumes that there is a natural key.  For content types with no
> digest information in the meta data, there may be a natural key
> for content within a repo version only, but no natural key for the
> overall content.  (If we want to support non-immediate modes for such
> content.  In immediate mode, a digest can be computed from the
> associated artifact(s)).

Can you give some examples of Content without a natural key?

> Of course, there are ways around that (use a UUID as the "natural" key,
> or add a UUID to the repo version key fields), but I would like to
> avoid that.
>>     On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkliban at redhat.com>
>>     wrote:
>>     Thank you Daniel for the explanation and for filing an issue[0] to do
>>     performance analysis of UUIDs.
>>     I really hope that we can switch back to using UUIDs so we can bring
>>     back MariaDB support for Pulp 3.
>>     [0] [2]https://pulp.plan.io/issues/4290
>>     On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dalley at redhat.com>
>>     wrote:
>>     To rephrase the problem a little bit:
>>     We need to bulk_create() a bunch of objects, and then after we do that
>>     we want to immediately be able to relate them with other objects, which
>>     means we need their PKs of the objects that were just created.
>>     In the case of auto-increment integer PKs, we can't know that PK value
>>     before it gets saved into the database.  Luckily, PostgreSQL (and
>>     Oracle) support a "RETURNING" keyword that does provides this
>>     information.  The raw SQL would look something like this:
>> INSERT INTO items (name) values ('bear') RETURNING id;
>>     Django uses this feature to set the PK field on the model objects it
>>     returns when you call bulk_create() on a list of unsaved model objects.
>>     Unfortunately, MySQL doesn't support this, so there's no way to figure
>>     out what the PKs of the objects you just saved were, so the ORM can't
>>     set that information on the returned model objects.
>>     UUID PKs circumvent this because the PK gets created outside of the
>>     database, prior to being saved in the database, and so Django *can*
>>     know what the PK will be when it gets saved.
>>     On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]bbouters at redhat.com>
>>     wrote:
>>     +1 to experimentation and also making sure that we understand the
>>     performance implications of the decision. I'm replying to this earlier
>>     note to restate my observations of the problem a bit more.
>>     More ideas and thoughts are welcome. This is a decision with a lot of
>>     aspects to consider.
>>     On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]pcreech at redhat.com>
>>     wrote:
>>       On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
>>       > When we switched from UUID to integers for the PK
>>       > with databases other than PostgreSQL [0].
>>       >
>>       > With a goal of database agnosticism for Pulp3, if plugin writers
>>       plan to use bulk_create with any object inherited
>>       > from one of ours, they can't will get different behaviors on
>>       different databases and they won't have PKs that they may
>>       > require. bulk_create is a normal django thing, so plugin writers
>>       making a django plugin should be able to use it. This
>>       > concerned me already, but today it was also brought up by non-RH
>>       plugin writers also [1] in a PR.
>>       >
>>       > The tradeoffs bteween UUIDs versus PKs are pretty well summed up
>>       in our ticket where we discussed that change [2].
>>       > Note, we did not consider this bulk_create downside at that time,
>>       which I think is the most significant downside to
>>       > consider.
>>       >
>>       > Having bulk_create effectively not available for plugin writers
>>       (since we can't rely on its pks being returned) I
>>       > think is a non-starter for me. I love how short the UUIDs made our
>>       URLs so that's the tradeoff mainly in my mind.
>>       > Those balanced against each other, I think we should switch back.
>>       >
>>       > Another option is to become PostgreSQL only which (though I love
>>       psql) I think would be the wrong choice for Pulp from
>>       > what I've heard from its users.
>>       >
>>       > What do you think? What should we do?
>>       So, my mind immediately goes to this question, which might be
>>       usefull for others to help make decisions, so I'll ask:
>>       When you say:
>>       "we lost the ability to have the primary key set during bulk_create"
>>       Can you clarify what you mean by this?
>>       My mind immediately goes to this chain of events:
>>               When you use bulk_create, the existing in-memory model
>>       objects representing the data to create do not get
>>       updated with the primary key values that are created in the
>>       database.
>>               Upon a subsequent query of the database, for the exact same
>>       set of objects just added, those objects _will_ have
>>       the primary key populated.
>>       In other words,
>>               The database records themselves get the auto-increment IDs
>>       added, they just don't get reported back in that
>>       query to the ORM layer, therefore it takes a subsequent query to get
>>       those ids out.
>>       Does that about sum it up?
>>     Yes this describes the situation, but there is a bit more to tell.
>>     Since PostgreSQL does return the ids the subsequent query that could be
>>     done to get the ids isn't written in code today. We didn't need to
>>     because we developed it against PostgreSQL. I'm pretty sure that if you
>>     configure Pulp against MySQL Pulp won't work, which I think is a
>>     problem. So I'm observing two things here. 1) This is a hazard that
>>     causes code to unexpectedly be only compliant with PostgreSQL. 2) Pulp
>>     itself fell into this hazard and we need to fix that too
>>     Do you also see these two issues? What should be done about these?
>>       >
>>       > [0]:
>>       [6]https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-
>>       create
>>       > [1]:
>>       [7]https://github.com/pulp/pulp/pull/3764#discussion_r234780702
>>       > [2]: [8]https://pulp.plan.io/issues/3848
>>       > _______________________________________________
>>       > Pulp-dev mailing list
>>       > [9]Pulp-dev at redhat.com
>>       > [10]https://www.redhat.com/mailman/listinfo/pulp-dev
>>       _______________________________________________
>>       Pulp-dev mailing list
>>       [11]Pulp-dev at redhat.com
>>       [12]https://www.redhat.com/mailman/listinfo/pulp-dev
>>       _______________________________________________
>>       Pulp-dev mailing list
>>       [13]Pulp-dev at redhat.com
>>       [14]https://www.redhat.com/mailman/listinfo/pulp-dev
>>       _______________________________________________
>>       Pulp-dev mailing list
>>       [15]Pulp-dev at redhat.com
>>       [16]https://www.redhat.com/mailman/listinfo/pulp-dev
>>       _______________________________________________
>>       Pulp-dev mailing list
>>       [17]Pulp-dev at redhat.com
>>       [18]https://www.redhat.com/mailman/listinfo/pulp-dev
>> References
>>     1. mailto:dkliban at redhat.com
>>     2. https://pulp.plan.io/issues/4290
>>     3. mailto:dalley at redhat.com
>>     4. mailto:bbouters at redhat.com
>>     5. mailto:pcreech at redhat.com
>>     6. https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
>>     7. https://github.com/pulp/pulp/pull/3764#discussion_r234780702
>>     8. https://pulp.plan.io/issues/3848
>>     9. mailto:Pulp-dev at redhat.com
>>    10. https://www.redhat.com/mailman/listinfo/pulp-dev
>>    11. mailto:Pulp-dev at redhat.com
>>    12. https://www.redhat.com/mailman/listinfo/pulp-dev
>>    13. mailto:Pulp-dev at redhat.com
>>    14. https://www.redhat.com/mailman/listinfo/pulp-dev
>>    15. mailto:Pulp-dev at redhat.com
>>    16. https://www.redhat.com/mailman/listinfo/pulp-dev
>>    17. mailto:Pulp-dev at redhat.com
>>    18. https://www.redhat.com/mailman/listinfo/pulp-dev
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev

More information about the Pulp-dev mailing list