[Pulp-dev] Integer IDs in Pulp 3

Thu Jul 19 14:20:04 UTC 2018

The PK for a task record in the db does not need to be the same as the 
job ID in rq/redis.  Consistency is good.  Let's make the Task.id (int 
like the rest of the tables) and add a job_id to correlate with rq/redis.

On 07/11/2018 03:20 PM, David Davis wrote:
> I actually started working on converting IDs from UUIDs to integer 
> IDs. It was pretty easy with one exception. Jobs in rq/redis are 
> created using task id[0] and this job id needs to be a uuid. I see two 
> possible solutions:
>
> 1. We leave task id as a UUID but every other id is an integer
> 2. We add a job uuid field on task
>
> With the hard numbers that show that integer IDs are significantly 
> faster, I think we should proceed unless anyone has a major objection.
>
> Great work on this btw.
>
> [0] 
> https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187
>
> David
>
>
> On Wed, Jul 11, 2018 at 3:56 PM Daniel Alley <dalley at redhat.com 
> <mailto:dalley at redhat.com>> wrote:
>
>     w/ creating 400,000 units, the non-uuid PK is 30% faster at 42.22
>     seconds vs. 55.98 seconds.
>
>     w/ searching through the same 400,000 units, performance is still
>     about 30% faster.  Doing a filter for file content units that have
>     a relative_path__startswith={some random letter} (I put UUIDs in
>     all the fields) takes about 0.44 seconds if the model has a UUID
>     pk and about 0.33 seconds if the model has a default Django
>     auto-incrementing PK.
>
>     On Wed, Jul 11, 2018 at 11:03 AM, Daniel Alley <dalley at redhat.com
>     <mailto:dalley at redhat.com>> wrote:
>
>         So, since I've already been working on some Pulp 3
>         benchmarking I decided to go ahead and benchmark this to get
>         some actual data.
>
>         Disclaimer:  The following data is using bulk_create() with a
>         modified, flat, non-inheriting content model, not the current
>         multi-table inherited content model we're currently using. 
>         It's also using bulk_create() which we are not currently using
>         in Pulp 3, but likely will end up using eventually.
>
>         Using normal IDs instead of UUIDs was between 13% and 25%
>         faster with 15,000 units.  15,000 units isn't really a
>         sufficient value to actually test index performance, so I'm
>         rerunning it with a few hundred thousand units, but that will
>         take a substantial amount of time to run.  I'll follow up later.
>
>         As far as search/update performance goes, that probably has
>         better margins than just insert performance, but I'll need to
>         write new code to benchmark that properly.
>
>         On Thu, May 24, 2018 at 11:52 AM, David Davis
>         <daviddavis at redhat.com <mailto:daviddavis at redhat.com>> wrote:
>
>             Agreed on performance. Doing some more Googling seems to
>             have mixed opinions on whether UUIDs performance is worse
>             or not. If this is a significant reason to switch, I agree
>             we should test out the performance.
>
>             Regarding the disk size, I think using UUIDs is
>             cumulative. Larger PKs mean bigger index sizes, bigger
>             FKs, etc. I agree that it’s probably not a major concern
>             but I wouldn’t say it’s trivial.
>
>             David
>
>             On Thu, May 24, 2018 at 11:27 AM, Sean Myers
>             <sean.myers at redhat.com <mailto:sean.myers at redhat.com>> wrote:
>
>                 Responses inline.
>
>                 On 05/23/2018 02:26 PM, David Davis wrote:
>                 > Before the release of Pulp 3.0 GA, I think it’s
>                 worth just checking in to
>                 > make sure we want to use UUIDs over integer based
>                 IDs. Changing from UUIDs
>                 > to ints would be a very easy change at this point 
>                 (1-2 lines of code) but
>                 > after GA ships, it would be hard if not impossible
>                 to switch.
>                 >
>                 > I think there are a number of reasons why we might
>                 want to consider integer
>                 > IDs:
>                 >
>                 > - Better performance all around for inserts[0],
>                 searches, indexing, etc
>
>                 I don't really care either way, but it's worth
>                 pointing out that UUIDs are
>                 integers (in the sense that the entire internet can be
>                 reduced to a single
>                 integer since it's all just bits). To the best of my
>                 knowledge they are equally
>                 performant to integers and stored in similar ways in
>                 Postgres.
>
>                 You linked a MySQL experiment, done using a version of
>                 MySQL that is nearly 10
>                 years old. If there are concerns about the performance
>                 of UUID PKs vs. int PKs
>                 in Pulp, we should compare apples to apples and
>                 profile Pulp using UUID PKs,
>                 profile Pulp using integer PKs, and then compare the two.
>
>                 In my small-scale testing (100,000 randomly generated
>                 content rows of a
>                 proto-RPM content model, 1000 repositories randomly
>                 related to each, no db funny
>                 business beyond enforced uniqueness constraints),
>                 there was either no
>                 difference, or what difference there was fell into the
>                 margin of error.
>
>                 > - Less storage required (4 bytes for int vs 16 byes
>                 for UUIDs)
>
>                 Well, okay...UUIDs are *huge* integers. But it's the
>                 length of an IPv6 address
>                 vs. the length of an IPv4 address. While it's true
>                 that 4 < 16, both are still
>                 pretty small. Trivially so, I think.
>
>                 Without taking relations into account, a table with a
>                 million rows should be a
>                 little less than twelve mega(mebi)bytes larger. Even
>                 at scale, the size
>                 difference is negligible, especially when compared to
>                 the size on disk of the
>                 actual content you'd need to be storing that those
>                 million rows represent.
>
>                 > - Hrefs would be shorter (e.g.
>                 /pulp/api/v3/repositories/1/)
>                 > - In line with other apps like Katello
>
>                 I think these two are definitely worth considering,
>                 though.
>
>                 > There are some downsides to consider though:
>                 >
>                 > - Integer ids expose info like how many records
>                 there are
>
>                 This was the main intent, if I recall correctly. UUID
>                 PKs are not:
>                 - monotonically increasing
>                 - variably sized (string length, not bit length)
>
>                 So an objects PK doesn't give you any indication of
>                 how many other objects may
>                 be in the same collection, and while the Hrefs are
>                 long, for any given resource
>                 they will always be a predictable size.
>
>                 The major downside is really that they're a pain in
>                 the butt to type out when
>                 compared to int PKs, so if users are in a situation
>                 where they do have to type
>                 these things out, I think something has gone wrong.
>
>                 If users typing in PKs can't be avoided, UUIDs
>                 probably should be avoided. I
>                 recognize that this is effectively a restatement of
>                 "Hrefs would be shorter" in
>                 the context of how that impacts the user.
>
>                 > - Can’t support sharding or multiple dbs (are we
>                 ever going to need this?)
>
>                 A very good question. To the best of my recollection
>                 this was never stated as a
>                 hard requirement; it was only ever mentioned like it
>                 is here, as a potential
>                 positive side-effect of UUID keys. If
>                 collision-avoidance is not desired, and
>                 will certainly never be desired, then a normal integer
>                 field would likely be a
>                 less astonishing[0] user experience, and therefore a
>                 better user experience.
>
>                 [0]:
>                 https://en.wikipedia.org/wiki/Principle_of_least_astonishment
>
>
>                 _______________________________________________
>                 Pulp-dev mailing list
>                 Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>                 https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
>
>             _______________________________________________
>             Pulp-dev mailing list
>             Pulp-dev at redhat.com <mailto:Pulp-dev at redhat.com>
>             https://www.redhat.com/mailman/listinfo/pulp-dev
>
>
>
>
>
> _______________________________________________
> Pulp-dev mailing list
> Pulp-dev at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20180719/32d59d39/attachment.htm>