[Pulp-dev] Concerns about bulk_create and PostgreSQL

Thu Jan 3 19:48:34 UTC 2019

I think adding a separate UUID column seems like a feasible workaround for
Content without a natural key. But I can understand the desire not to do so
in order to create an artificial natural key.

I guess we can weigh the performance results of using UUID PKs against how
often we expect plugin Content to have no natural key.

David

On Thu, Jan 3, 2019 at 2:28 PM Simon Baatz <gmbnomis at gmail.com> wrote:

> On Thu, Jan 03, 2019 at 01:02:57PM -0500, David Davis wrote:
> >    I don't think that using integer ids with bulk_create and supporting
> >    mysql/mariadb are necessarily mutually exclusive. I think there might
> >    be a way to find the records created using bulk_create if we know the
> >    natural key. It might be more performant than using UUIDs as well.
>
> This assumes that there is a natural key.  For content types with no
> digest information in the meta data, there may be a natural key
> for content within a repo version only, but no natural key for the
> overall content.  (If we want to support non-immediate modes for such
> content.  In immediate mode, a digest can be computed from the
> associated artifact(s)).
>
> Of course, there are ways around that (use a UUID as the "natural" key,
> or add a UUID to the repo version key fields), but I would like to
> avoid that.
>
> >    On Thu, Jan 3, 2019 at 11:04 AM Dennis Kliban <[1]dkliban at redhat.com>
> >    wrote:
> >
> >    Thank you Daniel for the explanation and for filing an issue[0] to do
> >    performance analysis of UUIDs.
> >    I really hope that we can switch back to using UUIDs so we can bring
> >    back MariaDB support for Pulp 3.
> >    [0] [2]https://pulp.plan.io/issues/4290
> >
> >    On Wed, Dec 5, 2018 at 1:35 PM Daniel Alley <[3]dalley at redhat.com>
> >    wrote:
> >
> >    To rephrase the problem a little bit:
> >    We need to bulk_create() a bunch of objects, and then after we do that
> >    we want to immediately be able to relate them with other objects,
> which
> >    means we need their PKs of the objects that were just created.
> >    In the case of auto-increment integer PKs, we can't know that PK value
> >    before it gets saved into the database.  Luckily, PostgreSQL (and
> >    Oracle) support a "RETURNING" keyword that does provides this
> >    information.  The raw SQL would look something like this:
> >
> > INSERT INTO items (name) values ('bear') RETURNING id;
> >
> >    Django uses this feature to set the PK field on the model objects it
> >    returns when you call bulk_create() on a list of unsaved model
> objects.
> >    Unfortunately, MySQL doesn't support this, so there's no way to figure
> >    out what the PKs of the objects you just saved were, so the ORM can't
> >    set that information on the returned model objects.
> >    UUID PKs circumvent this because the PK gets created outside of the
> >    database, prior to being saved in the database, and so Django *can*
> >    know what the PK will be when it gets saved.
> >
> >    On Wed, Dec 5, 2018 at 12:11 PM Brian Bouterse <[4]
> bbouters at redhat.com>
> >    wrote:
> >
> >    +1 to experimentation and also making sure that we understand the
> >    performance implications of the decision. I'm replying to this earlier
> >    note to restate my observations of the problem a bit more.
> >    More ideas and thoughts are welcome. This is a decision with a lot of
> >    aspects to consider.
> >    On Tue, Nov 20, 2018 at 10:00 AM Patrick Creech <[5]
> pcreech at redhat.com>
> >    wrote:
> >
> >      On Mon, 2018-11-19 at 17:08 -0500, Brian Bouterse wrote:
> >      > When we switched from UUID to integers for the PK
> >      > with databases other than PostgreSQL [0].
> >      >
> >      > With a goal of database agnosticism for Pulp3, if plugin writers
> >      plan to use bulk_create with any object inherited
> >      > from one of ours, they can't will get different behaviors on
> >      different databases and they won't have PKs that they may
> >      > require. bulk_create is a normal django thing, so plugin writers
> >      making a django plugin should be able to use it. This
> >      > concerned me already, but today it was also brought up by non-RH
> >      plugin writers also [1] in a PR.
> >      >
> >      > The tradeoffs bteween UUIDs versus PKs are pretty well summed up
> >      in our ticket where we discussed that change [2].
> >      > Note, we did not consider this bulk_create downside at that time,
> >      which I think is the most significant downside to
> >      > consider.
> >      >
> >      > Having bulk_create effectively not available for plugin writers
> >      (since we can't rely on its pks being returned) I
> >      > think is a non-starter for me. I love how short the UUIDs made our
> >      URLs so that's the tradeoff mainly in my mind.
> >      > Those balanced against each other, I think we should switch back.
> >      >
> >      > Another option is to become PostgreSQL only which (though I love
> >      psql) I think would be the wrong choice for Pulp from
> >      > what I've heard from its users.
> >      >
> >      > What do you think? What should we do?
> >      So, my mind immediately goes to this question, which might be
> >      usefull for others to help make decisions, so I'll ask:
> >      When you say:
> >      "we lost the ability to have the primary key set during bulk_create"
> >      Can you clarify what you mean by this?
> >      My mind immediately goes to this chain of events:
> >              When you use bulk_create, the existing in-memory model
> >      objects representing the data to create do not get
> >      updated with the primary key values that are created in the
> >      database.
> >              Upon a subsequent query of the database, for the exact same
> >      set of objects just added, those objects _will_ have
> >      the primary key populated.
> >      In other words,
> >              The database records themselves get the auto-increment IDs
> >      added, they just don't get reported back in that
> >      query to the ORM layer, therefore it takes a subsequent query to get
> >      those ids out.
> >      Does that about sum it up?
> >
> >    Yes this describes the situation, but there is a bit more to tell.
> >    Since PostgreSQL does return the ids the subsequent query that could
> be
> >    done to get the ids isn't written in code today. We didn't need to
> >    because we developed it against PostgreSQL. I'm pretty sure that if
> you
> >    configure Pulp against MySQL Pulp won't work, which I think is a
> >    problem. So I'm observing two things here. 1) This is a hazard that
> >    causes code to unexpectedly be only compliant with PostgreSQL. 2) Pulp
> >    itself fell into this hazard and we need to fix that too
> >    Do you also see these two issues? What should be done about these?
> >
> >      >
> >      > [0]:
> >      [6]
> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-
> >      create
> >      > [1]:
> >      [7]https://github.com/pulp/pulp/pull/3764#discussion_r234780702
> >      > [2]: [8]https://pulp.plan.io/issues/3848
> >      > _______________________________________________
> >      > Pulp-dev mailing list
> >      > [9]Pulp-dev at redhat.com
> >      > [10]https://www.redhat.com/mailman/listinfo/pulp-dev
> >      _______________________________________________
> >      Pulp-dev mailing list
> >      [11]Pulp-dev at redhat.com
> >      [12]https://www.redhat.com/mailman/listinfo/pulp-dev
> >
> >      _______________________________________________
> >      Pulp-dev mailing list
> >      [13]Pulp-dev at redhat.com
> >      [14]https://www.redhat.com/mailman/listinfo/pulp-dev
> >
> >      _______________________________________________
> >      Pulp-dev mailing list
> >      [15]Pulp-dev at redhat.com
> >      [16]https://www.redhat.com/mailman/listinfo/pulp-dev
> >
> >      _______________________________________________
> >      Pulp-dev mailing list
> >      [17]Pulp-dev at redhat.com
> >      [18]https://www.redhat.com/mailman/listinfo/pulp-dev
> >
> > References
> >
> >    1. mailto:dkliban at redhat.com
> >    2. https://pulp.plan.io/issues/4290
> >    3. mailto:dalley at redhat.com
> >    4. mailto:bbouters at redhat.com
> >    5. mailto:pcreech at redhat.com
> >    6.
> https://docs.djangoproject.com/en/2.1/ref/models/querysets/#bulk-create
> >    7. https://github.com/pulp/pulp/pull/3764#discussion_r234780702
> >    8. https://pulp.plan.io/issues/3848
> >    9. mailto:Pulp-dev at redhat.com
> >   10. https://www.redhat.com/mailman/listinfo/pulp-dev
> >   11. mailto:Pulp-dev at redhat.com
> >   12. https://www.redhat.com/mailman/listinfo/pulp-dev
> >   13. mailto:Pulp-dev at redhat.com
> >   14. https://www.redhat.com/mailman/listinfo/pulp-dev
> >   15. mailto:Pulp-dev at redhat.com
> >   16. https://www.redhat.com/mailman/listinfo/pulp-dev
> >   17. mailto:Pulp-dev at redhat.com
> >   18. https://www.redhat.com/mailman/listinfo/pulp-dev
>
> > _______________________________________________
> > Pulp-dev mailing list
> > Pulp-dev at redhat.com
> > https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
> --
> Simon Baatz <gmbnomis at gmail.com>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20190103/0b417b98/attachment.htm>