[Pulp-dev] Integer IDs in Pulp 3

Wed Jul 11 20:39:05 UTC 2018

There is now:

https://pulp.plan.io/issues/3848

David

On Wed, Jul 11, 2018 at 4:23 PM Brian Bouterse <bbouters at redhat.com> wrote:

> A 30% improvement I think is a good case for integers over uuids.
>
> Is there a ticket tracking that change?
>
> On Wed, Jul 11, 2018 at 3:55 PM, Daniel Alley <dalley at redhat.com> wrote:
>
>> w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22 seconds
>> vs. 55.98 seconds.
>>
>> w/ searching through the same 400,000 units, performance is still about
>> 30% faster.  Doing a filter for file content units that have a
>> relative_path__startswith={some random letter} (I put UUIDs in all the
>> fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
>> seconds if the model has a default Django auto-incrementing PK.
>>
>> On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <dalley at redhat.com> wrote:
>>
>>> So, since I've already been working on some Pulp 3 benchmarking I
>>> decided to go ahead and benchmark this to get some actual data.
>>>
>>> Disclaimer:  The following data is using bulk_create() with a modified,
>>> flat, non-inheriting content model, not the current multi-table inherited
>>> content model we're currently using.  It's also using bulk_create() which
>>> we are not currently using in Pulp 3, but likely will end up using
>>> eventually.
>>>
>>> Using normal IDs instead of UUIDs was between 13% and 25% faster with
>>> 15,000 units.  15,000 units isn't really a sufficient value to actually
>>> test index performance, so I'm rerunning it with a few hundred thousand
>>> units, but that will take a substantial amount of time to run.  I'll follow
>>> up later.
>>>
>>> As far as search/update performance goes, that probably has better
>>> margins than just insert performance, but I'll need to write new code to
>>> benchmark that properly.
>>>
>>> On Thu, May 24, 2018 at 11:52 AM, David Davis <daviddavis at redhat.com>
>>> wrote:
>>>
>>>> Agreed on performance. Doing some more Googling seems to have mixed
>>>> opinions on whether UUIDs performance is worse or not. If this is a
>>>> significant reason to switch, I agree we should test out the performance.
>>>>
>>>> Regarding the disk size, I think using UUIDs is cumulative. Larger PKs
>>>> mean bigger index sizes, bigger FKs, etc. I agree that it’s probably not a
>>>> major concern but I wouldn’t say it’s trivial.
>>>>
>>>> David
>>>>
>>>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <sean.myers at redhat.com>
>>>> wrote:
>>>>
>>>>> Responses inline.
>>>>>
>>>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>>>> > Before the release of Pulp 3.0 GA, I think it’s worth just checking
>>>>> in to
>>>>> > make sure we want to use UUIDs over integer based IDs. Changing from
>>>>> UUIDs
>>>>> > to ints would be a very easy change at this point  (1-2 lines of
>>>>> code) but
>>>>> > after GA ships, it would be hard if not impossible to switch.
>>>>> >
>>>>> > I think there are a number of reasons why we might want to consider
>>>>> integer
>>>>> > IDs:
>>>>> >
>>>>> > - Better performance all around for inserts[0], searches, indexing,
>>>>> etc
>>>>>
>>>>> I don't really care either way, but it's worth pointing out that UUIDs
>>>>> are
>>>>> integers (in the sense that the entire internet can be reduced to a
>>>>> single
>>>>> integer since it's all just bits). To the best of my knowledge they
>>>>> are equally
>>>>> performant to integers and stored in similar ways in Postgres.
>>>>>
>>>>> You linked a MySQL experiment, done using a version of MySQL that is
>>>>> nearly 10
>>>>> years old. If there are concerns about the performance of UUID PKs vs.
>>>>> int PKs
>>>>> in Pulp, we should compare apples to apples and profile Pulp using
>>>>> UUID PKs,
>>>>> profile Pulp using integer PKs, and then compare the two.
>>>>>
>>>>> In my small-scale testing (100,000 randomly generated content rows of a
>>>>> proto-RPM content model, 1000 repositories randomly related to each,
>>>>> no db funny
>>>>> business beyond enforced uniqueness constraints), there was either no
>>>>> difference, or what difference there was fell into the margin of error.
>>>>>
>>>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>>>
>>>>> Well, okay...UUIDs are *huge* integers. But it's the length of an IPv6
>>>>> address
>>>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both
>>>>> are still
>>>>> pretty small. Trivially so, I think.
>>>>>
>>>>> Without taking relations into account, a table with a million rows
>>>>> should be a
>>>>> little less than twelve mega(mebi)bytes larger. Even at scale, the size
>>>>> difference is negligible, especially when compared to the size on disk
>>>>> of the
>>>>> actual content you'd need to be storing that those million rows
>>>>> represent.
>>>>>
>>>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>>>> > - In line with other apps like Katello
>>>>>
>>>>> I think these two are definitely worth considering, though.
>>>>>
>>>>> > There are some downsides to consider though:
>>>>> >
>>>>> > - Integer ids expose info like how many records there are
>>>>>
>>>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>>>> - monotonically increasing
>>>>> - variably sized (string length, not bit length)
>>>>>
>>>>> So an objects PK doesn't give you any indication of how many other
>>>>> objects may
>>>>> be in the same collection, and while the Hrefs are long, for any given
>>>>> resource
>>>>> they will always be a predictable size.
>>>>>
>>>>> The major downside is really that they're a pain in the butt to type
>>>>> out when
>>>>> compared to int PKs, so if users are in a situation where they do have
>>>>> to type
>>>>> these things out, I think something has gone wrong.
>>>>>
>>>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>>>> avoided. I
>>>>> recognize that this is effectively a restatement of "Hrefs would be
>>>>> shorter" in
>>>>> the context of how that impacts the user.
>>>>>
>>>>> > - Can’t support sharding or multiple dbs (are we ever going to need
>>>>> this?)
>>>>>
>>>>> A very good question. To the best of my recollection this was never
>>>>> stated as a
>>>>> hard requirement; it was only ever mentioned like it is here, as a
>>>>> potential
>>>>> positive side-effect of UUID keys. If collision-avoidance is not
>>>>> desired, and
>>>>> will certainly never be desired, then a normal integer field would
>>>>> likely be a
>>>>> less astonishing[0] user experience, and therefore a better user
>>>>> experience.
>>>>>
>>>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pulp-dev mailing list
>>>>> Pulp-dev at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Pulp-dev mailing list
>>>> Pulp-dev at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180711/6d7f3e6b/attachment.htm>