[Pulp-dev] Integer IDs in Pulp 3

David Davis daviddavis at redhat.com
Wed Jul 18 13:25:17 UTC 2018


The change to use int ids[0] has been merged. When you pull from pulp/pulp,
you’ll have to redo your migrations and recreate your database. The first
option is recreate your dev environment if you’re running Vagrant.

Alternatively, you can remove all the migrations from pulp (rm -rf
pulpcore/pulpcore/app/migrations/*) and your plugins (rm -rf
pulp_file/pulp_file/app/migrations/*). Then regenerate and rerun them:

  pulp-manager makemigrations pulp_file
  pulp-manager makemigrations pulp_app
  pulp-manager makemigrations auth

  pulp-manager migrate auth
  pulp-manager migrate

[0] https://github.com/pulp/pulp/pull/3549

David


On Thu, Jul 12, 2018 at 8:46 AM Brian Bouterse <bbouters at redhat.com> wrote:

> I'm +1 on grooming that ticket and sprint nominating it. I commented on
> question there about how to handle RQ.
>
> On Wed, Jul 11, 2018 at 4:53 PM, Dennis Kliban <dkliban at redhat.com> wrote:
>
>> Thanks David. I am in favor of this  change.
>>
>> On Wed, Jul 11, 2018 at 4:39 PM, David Davis <daviddavis at redhat.com>
>> wrote:
>>
>>> There is now:
>>>
>>> https://pulp.plan.io/issues/3848
>>>
>>> David
>>>
>>>
>>> On Wed, Jul 11, 2018 at 4:23 PM Brian Bouterse <bbouters at redhat.com>
>>> wrote:
>>>
>>>> A 30% improvement I think is a good case for integers over uuids.
>>>>
>>>> Is there a ticket tracking that change?
>>>>
>>>> On Wed, Jul 11, 2018 at 3:55 PM, Daniel Alley <dalley at redhat.com>
>>>> wrote:
>>>>
>>>>> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
>>>>> seconds vs. 55.98 seconds.
>>>>>
>>>>> w/ searching through the same 400,000 units, performance is still
>>>>> about 30% faster.  Doing a filter for file content units that have a
>>>>> relative_path__startswith={some random letter} (I put UUIDs in all the
>>>>> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
>>>>> seconds if the model has a default Django auto-incrementing PK.
>>>>>
>>>>> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <dalley at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> So, since I've already been working on some Pulp 3 benchmarking I
>>>>>> decided to go ahead and benchmark this to get some actual data.
>>>>>>
>>>>>> Disclaimer:  The following data is using bulk_create() with a
>>>>>> modified, flat, non-inheriting content model, not the current multi-table
>>>>>> inherited content model we're currently using.  It's also using
>>>>>> bulk_create() which we are not currently using in Pulp 3, but likely will
>>>>>> end up using eventually.
>>>>>>
>>>>>> Using normal IDs instead of UUIDs was between 13% and 25% faster with
>>>>>> 15,000 units.  15,000 units isn't really a sufficient value to actually
>>>>>> test index performance, so I'm rerunning it with a few hundred thousand
>>>>>> units, but that will take a substantial amount of time to run.  I'll follow
>>>>>> up later.
>>>>>>
>>>>>> As far as search/update performance goes, that probably has better
>>>>>> margins than just insert performance, but I'll need to write new code to
>>>>>> benchmark that properly.
>>>>>>
>>>>>> On Thu, May 24, 2018 at 11:52 AM, David Davis <daviddavis at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Agreed on performance. Doing some more Googling seems to have mixed
>>>>>>> opinions on whether UUIDs performance is worse or not. If this is a
>>>>>>> significant reason to switch, I agree we should test out the performance.
>>>>>>>
>>>>>>> Regarding the disk size, I think using UUIDs is cumulative. Larger
>>>>>>> PKs mean bigger index sizes, bigger FKs, etc. I agree that it’s probably
>>>>>>> not a major concern but I wouldn’t say it’s trivial.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <sean.myers at redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Responses inline.
>>>>>>>>
>>>>>>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>>>>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just
>>>>>>>> checking in to
>>>>>>>> > make sure we want to use UUIDs over integer based IDs. Changing
>>>>>>>> from UUIDs
>>>>>>>> > to ints would be a very easy change at this point  (1-2 lines of
>>>>>>>> code) but
>>>>>>>> > after GA ships, it would be hard if not impossible to switch.
>>>>>>>> >
>>>>>>>> > I think there are a number of reasons why we might want to
>>>>>>>> consider integer
>>>>>>>> > IDs:
>>>>>>>> >
>>>>>>>> > - Better performance all around for inserts[0], searches,
>>>>>>>> indexing, etc
>>>>>>>>
>>>>>>>> I don't really care either way, but it's worth pointing out that
>>>>>>>> UUIDs are
>>>>>>>> integers (in the sense that the entire internet can be reduced to a
>>>>>>>> single
>>>>>>>> integer since it's all just bits). To the best of my knowledge they
>>>>>>>> are equally
>>>>>>>> performant to integers and stored in similar ways in Postgres.
>>>>>>>>
>>>>>>>> You linked a MySQL experiment, done using a version of MySQL that
>>>>>>>> is nearly 10
>>>>>>>> years old. If there are concerns about the performance of UUID PKs
>>>>>>>> vs. int PKs
>>>>>>>> in Pulp, we should compare apples to apples and profile Pulp using
>>>>>>>> UUID PKs,
>>>>>>>> profile Pulp using integer PKs, and then compare the two.
>>>>>>>>
>>>>>>>> In my small-scale testing (100,000 randomly generated content rows
>>>>>>>> of a
>>>>>>>> proto-RPM content model, 1000 repositories randomly related to
>>>>>>>> each, no db funny
>>>>>>>> business beyond enforced uniqueness constraints), there was either
>>>>>>>> no
>>>>>>>> difference, or what difference there was fell into the margin of
>>>>>>>> error.
>>>>>>>>
>>>>>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>>>>>>
>>>>>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an
>>>>>>>> IPv6 address
>>>>>>>> vs. the length of an IPv4 address. While it's true that 4 < 16,
>>>>>>>> both are still
>>>>>>>> pretty small. Trivially so, I think.
>>>>>>>>
>>>>>>>> Without taking relations into account, a table with a million rows
>>>>>>>> should be a
>>>>>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the
>>>>>>>> size
>>>>>>>> difference is negligible, especially when compared to the size on
>>>>>>>> disk of the
>>>>>>>> actual content you'd need to be storing that those million rows
>>>>>>>> represent.
>>>>>>>>
>>>>>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>>>>>>> > - In line with other apps like Katello
>>>>>>>>
>>>>>>>> I think these two are definitely worth considering, though.
>>>>>>>>
>>>>>>>> > There are some downsides to consider though:
>>>>>>>> >
>>>>>>>> > - Integer ids expose info like how many records there are
>>>>>>>>
>>>>>>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>>>>>>> - monotonically increasing
>>>>>>>> - variably sized (string length, not bit length)
>>>>>>>>
>>>>>>>> So an objects PK doesn't give you any indication of how many other
>>>>>>>> objects may
>>>>>>>> be in the same collection, and while the Hrefs are long, for any
>>>>>>>> given resource
>>>>>>>> they will always be a predictable size.
>>>>>>>>
>>>>>>>> The major downside is really that they're a pain in the butt to
>>>>>>>> type out when
>>>>>>>> compared to int PKs, so if users are in a situation where they do
>>>>>>>> have to type
>>>>>>>> these things out, I think something has gone wrong.
>>>>>>>>
>>>>>>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>>>>>>> avoided. I
>>>>>>>> recognize that this is effectively a restatement of "Hrefs would be
>>>>>>>> shorter" in
>>>>>>>> the context of how that impacts the user.
>>>>>>>>
>>>>>>>> > - Can’t support sharding or multiple dbs (are we ever going to
>>>>>>>> need this?)
>>>>>>>>
>>>>>>>> A very good question. To the best of my recollection this was never
>>>>>>>> stated as a
>>>>>>>> hard requirement; it was only ever mentioned like it is here, as a
>>>>>>>> potential
>>>>>>>> positive side-effect of UUID keys. If collision-avoidance is not
>>>>>>>> desired, and
>>>>>>>> will certainly never be desired, then a normal integer field would
>>>>>>>> likely be a
>>>>>>>> less astonishing[0] user experience, and therefore a better user
>>>>>>>> experience.
>>>>>>>>
>>>>>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pulp-dev mailing list
>>>>>>>> Pulp-dev at redhat.com
>>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pulp-dev mailing list
>>>>>>> Pulp-dev at redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180718/e68db033/attachment.htm>


More information about the Pulp-dev mailing list