[Pulp-dev] Integer IDs in Pulp 3

Wed Jul 11 19:55:44 UTC 2018

w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22 seconds
vs. 55.98 seconds.

w/ searching through the same 400,000 units, performance is still about 30%
faster.  Doing a filter for file content units that have a
relative_path__startswith={some random letter} (I put UUIDs in all the
fields) takes about 0.44 seconds if the model has a UUID pk and about 0.33
seconds if the model has a default Django auto-incrementing PK.

On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <dalley at redhat.com> wrote:

> So, since I've already been working on some Pulp 3 benchmarking I decided
> to go ahead and benchmark this to get some actual data.
>
> Disclaimer:  The following data is using bulk_create() with a modified,
> flat, non-inheriting content model, not the current multi-table inherited
> content model we're currently using.  It's also using bulk_create() which
> we are not currently using in Pulp 3, but likely will end up using
> eventually.
>
> Using normal IDs instead of UUIDs was between 13% and 25% faster with
> 15,000 units.  15,000 units isn't really a sufficient value to actually
> test index performance, so I'm rerunning it with a few hundred thousand
> units, but that will take a substantial amount of time to run.  I'll follow
> up later.
>
> As far as search/update performance goes, that probably has better margins
> than just insert performance, but I'll need to write new code to benchmark
> that properly.
>
> On Thu, May 24, 2018 at 11:52 AM, David Davis <daviddavis at redhat.com>
> wrote:
>
>> Agreed on performance. Doing some more Googling seems to have mixed
>> opinions on whether UUIDs performance is worse or not. If this is a
>> significant reason to switch, I agree we should test out the performance.
>>
>> Regarding the disk size, I think using UUIDs is cumulative. Larger PKs
>> mean bigger index sizes, bigger FKs, etc. I agree that it’s probably not a
>> major concern but I wouldn’t say it’s trivial.
>>
>> David
>>
>> On Thu, May 24, 2018 at 11:27 AM, Sean Myers <sean.myers at redhat.com>
>> wrote:
>>
>>> Responses inline.
>>>
>>> On 05/23/2018 02:26 PM, David Davis wrote:
>>> > Before the release of Pulp 3.0 GA, I think it’s worth just checking in
>>> to
>>> > make sure we want to use UUIDs over integer based IDs. Changing from
>>> UUIDs
>>> > to ints would be a very easy change at this point  (1-2 lines of code)
>>> but
>>> > after GA ships, it would be hard if not impossible to switch.
>>> >
>>> > I think there are a number of reasons why we might want to consider
>>> integer
>>> > IDs:
>>> >
>>> > - Better performance all around for inserts[0], searches, indexing, etc
>>>
>>> I don't really care either way, but it's worth pointing out that UUIDs
>>> are
>>> integers (in the sense that the entire internet can be reduced to a
>>> single
>>> integer since it's all just bits). To the best of my knowledge they are
>>> equally
>>> performant to integers and stored in similar ways in Postgres.
>>>
>>> You linked a MySQL experiment, done using a version of MySQL that is
>>> nearly 10
>>> years old. If there are concerns about the performance of UUID PKs vs.
>>> int PKs
>>> in Pulp, we should compare apples to apples and profile Pulp using UUID
>>> PKs,
>>> profile Pulp using integer PKs, and then compare the two.
>>>
>>> In my small-scale testing (100,000 randomly generated content rows of a
>>> proto-RPM content model, 1000 repositories randomly related to each, no
>>> db funny
>>> business beyond enforced uniqueness constraints), there was either no
>>> difference, or what difference there was fell into the margin of error.
>>>
>>> > - Less storage required (4 bytes for int vs 16 byes for UUIDs)
>>>
>>> Well, okay...UUIDs are *huge* integers. But it's the length of an IPv6
>>> address
>>> vs. the length of an IPv4 address. While it's true that 4 < 16, both are
>>> still
>>> pretty small. Trivially so, I think.
>>>
>>> Without taking relations into account, a table with a million rows
>>> should be a
>>> little less than twelve mega(mebi)bytes larger. Even at scale, the size
>>> difference is negligible, especially when compared to the size on disk
>>> of the
>>> actual content you'd need to be storing that those million rows
>>> represent.
>>>
>>> > - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
>>> > - In line with other apps like Katello
>>>
>>> I think these two are definitely worth considering, though.
>>>
>>> > There are some downsides to consider though:
>>> >
>>> > - Integer ids expose info like how many records there are
>>>
>>> This was the main intent, if I recall correctly. UUID PKs are not:
>>> - monotonically increasing
>>> - variably sized (string length, not bit length)
>>>
>>> So an objects PK doesn't give you any indication of how many other
>>> objects may
>>> be in the same collection, and while the Hrefs are long, for any given
>>> resource
>>> they will always be a predictable size.
>>>
>>> The major downside is really that they're a pain in the butt to type out
>>> when
>>> compared to int PKs, so if users are in a situation where they do have
>>> to type
>>> these things out, I think something has gone wrong.
>>>
>>> If users typing in PKs can't be avoided, UUIDs probably should be
>>> avoided. I
>>> recognize that this is effectively a restatement of "Hrefs would be
>>> shorter" in
>>> the context of how that impacts the user.
>>>
>>> > - Can’t support sharding or multiple dbs (are we ever going to need
>>> this?)
>>>
>>> A very good question. To the best of my recollection this was never
>>> stated as a
>>> hard requirement; it was only ever mentioned like it is here, as a
>>> potential
>>> positive side-effect of UUID keys. If collision-avoidance is not
>>> desired, and
>>> will certainly never be desired, then a normal integer field would
>>> likely be a
>>> less astonishing[0] user experience, and therefore a better user
>>> experience.
>>>
>>> [0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>>>
>>>
>>> _______________________________________________
>>> Pulp-dev mailing list
>>> Pulp-dev at redhat.com
>>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>>
>>>
>>
>> _______________________________________________
>> Pulp-dev mailing list
>> Pulp-dev at redhat.com
>> https://www.redhat.com/mailman/listinfo/pulp-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180711/1e80bc6e/attachment.htm>