[Pulp-dev] Integer IDs in Pulp 3

Thu May 24 15:27:01 UTC 2018

Responses inline.

On 05/23/2018 02:26 PM, David Davis wrote:
> Before the release of Pulp 3.0 GA, I think it’s worth just checking in to
> make sure we want to use UUIDs over integer based IDs. Changing from UUIDs
> to ints would be a very easy change at this point  (1-2 lines of code) but
> after GA ships, it would be hard if not impossible to switch.
> 
> I think there are a number of reasons why we might want to consider integer
> IDs:
> 
> - Better performance all around for inserts[0], searches, indexing, etc

I don't really care either way, but it's worth pointing out that UUIDs are
integers (in the sense that the entire internet can be reduced to a single
integer since it's all just bits). To the best of my knowledge they are equally
performant to integers and stored in similar ways in Postgres.

You linked a MySQL experiment, done using a version of MySQL that is nearly 10
years old. If there are concerns about the performance of UUID PKs vs. int PKs
in Pulp, we should compare apples to apples and profile Pulp using UUID PKs,
profile Pulp using integer PKs, and then compare the two.

In my small-scale testing (100,000 randomly generated content rows of a
proto-RPM content model, 1000 repositories randomly related to each, no db funny
business beyond enforced uniqueness constraints), there was either no
difference, or what difference there was fell into the margin of error.

> - Less storage required (4 bytes for int vs 16 byes for UUIDs)

Well, okay...UUIDs are *huge* integers. But it's the length of an IPv6 address
vs. the length of an IPv4 address. While it's true that 4 < 16, both are still
pretty small. Trivially so, I think.

Without taking relations into account, a table with a million rows should be a
little less than twelve mega(mebi)bytes larger. Even at scale, the size
difference is negligible, especially when compared to the size on disk of the
actual content you'd need to be storing that those million rows represent.

> - Hrefs would be shorter (e.g. /pulp/api/v3/repositories/1/)
> - In line with other apps like Katello

I think these two are definitely worth considering, though.

> There are some downsides to consider though:
> 
> - Integer ids expose info like how many records there are

This was the main intent, if I recall correctly. UUID PKs are not:
- monotonically increasing
- variably sized (string length, not bit length)

So an objects PK doesn't give you any indication of how many other objects may
be in the same collection, and while the Hrefs are long, for any given resource
they will always be a predictable size.

The major downside is really that they're a pain in the butt to type out when
compared to int PKs, so if users are in a situation where they do have to type
these things out, I think something has gone wrong.

If users typing in PKs can't be avoided, UUIDs probably should be avoided. I
recognize that this is effectively a restatement of "Hrefs would be shorter" in
the context of how that impacts the user.

> - Can’t support sharding or multiple dbs (are we ever going to need this?)

A very good question. To the best of my recollection this was never stated as a
hard requirement; it was only ever mentioned like it is here, as a potential
positive side-effect of UUID keys. If collision-avoidance is not desired, and
will certainly never be desired, then a normal integer field would likely be a
less astonishing[0] user experience, and therefore a better user experience.

[0]: https://en.wikipedia.org/wiki/Principle_of_least_astonishment

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 866 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180524/74b67cca/attachment.sig>