<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
The PK for a task record in the db does not need to be the same as
the job ID in rq/redis. Consistency is good. Let's make the
Task.id (int like the rest of the tables) and add a job_id to
correlate with rq/redis.<br>
<br>
<div class="moz-cite-prefix">On 07/11/2018 03:20 PM, David Davis
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAHa=2W=1UMfkNYB0s=DKiA_AUodkrW=LGapBtM7FN=xxxy0vJw@mail.gmail.com">
<div dir="ltr">I actually started working on converting IDs from
UUIDs to integer IDs. It was pretty easy with one exception.
Jobs in rq/redis are created using task id[0] and this job id
needs to be a uuid. I see two possible solutions:
<div><br>
</div>
<div>1. We leave task id as a UUID but every other id is an
integer</div>
<div>2. We add a job uuid field on task<br>
<div><br>
<div>With the hard numbers that show that integer IDs are
significantly faster, I think we should proceed unless
anyone has a major objection.</div>
<div><br>
</div>
<div>Great work on this btw.</div>
<div><br>
</div>
<div>[0] <a
href="https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187"
target="_blank" moz-do-not-send="true">https://github.com/pulp/pulp/blob/9bfc50d90a24c9d0ac4a93f5718187515b947058/pulpcore/pulpcore/tasking/tasks.py#L187</a></div>
</div>
</div>
<div>
<div dir="ltr" class="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div><br>
</div>
<div>David<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr">On Wed, Jul 11, 2018 at 3:56 PM Daniel Alley <<a
href="mailto:dalley@redhat.com" moz-do-not-send="true">dalley@redhat.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>w/ creating 400,000 units, the non-uuid PK is 30%
faster at 42.22 seconds vs. 55.98 seconds.</div>
<div><br>
</div>
<div>w/ searching through the same 400,000 units,
performance is still about 30% faster. Doing a filter for
file content units that have a
relative_path__startswith={some random letter} (I put
UUIDs in all the fields) takes about 0.44 seconds if the
model has a UUID pk and about 0.33 seconds if the model
has a default Django auto-incrementing PK.</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Jul 11, 2018 at 11:03 AM,
Daniel Alley <span dir="ltr"><<a
href="mailto:dalley@redhat.com" target="_blank"
moz-do-not-send="true">dalley@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>So, since I've already been working on some Pulp
3 benchmarking I decided to go ahead and benchmark
this to get some actual data.</div>
<div><br>
</div>
<div>Disclaimer: The following data is using
bulk_create() with a modified, flat, non-inheriting
content model, not the current multi-table inherited
content model we're currently using. It's also
using bulk_create() which we are not currently using
in Pulp 3, but likely will end up using eventually.<br>
</div>
<div><br>
</div>
<div>Using normal IDs instead of UUIDs was between 13%
and 25% faster with 15,000 units. 15,000 units
isn't really a sufficient value to actually test
index performance, so I'm rerunning it with a few
hundred thousand units, but that will take a
substantial amount of time to run. I'll follow up
later.<br>
</div>
<div><br>
</div>
<div>As far as search/update performance goes, that
probably has better margins than just insert
performance, but I'll need to write new code to
benchmark that properly.<br>
</div>
</div>
<div class="m_-3769914636103084512HOEnZb">
<div class="m_-3769914636103084512h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Thu, May 24, 2018 at
11:52 AM, David Davis <span dir="ltr"><<a
href="mailto:daviddavis@redhat.com"
target="_blank" moz-do-not-send="true">daviddavis@redhat.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">Agreed on performance. Doing
some more Googling seems to have mixed
opinions on whether UUIDs performance is
worse or not. If this is a significant
reason to switch, I agree we should test out
the performance.<br>
<div><br>
</div>
<div>Regarding the disk size, I think using
UUIDs is cumulative. Larger PKs mean
bigger index sizes, bigger FKs, etc. I
agree that it’s probably not a major
concern but I wouldn’t say it’s trivial.</div>
<div class="gmail_extra"><span
class="m_-3769914636103084512m_8458828713642419313HOEnZb"><font
color="#888888">
<div>
<div
class="m_-3769914636103084512m_8458828713642419313m_5408261673236411045m_7778541513043329500gmail_signature"
data-smartmail="gmail_signature">
<div dir="ltr">
<div>
<div dir="ltr">
<div>
<div dir="ltr">
<div><br>
</div>
<div>David<br>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
</font></span>
<div class="gmail_quote">
<div>
<div
class="m_-3769914636103084512m_8458828713642419313h5">On
Thu, May 24, 2018 at 11:27 AM, Sean
Myers <span dir="ltr"><<a
href="mailto:sean.myers@redhat.com"
target="_blank"
moz-do-not-send="true">sean.myers@redhat.com</a>></span>
wrote:<br>
</div>
</div>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div>
<div
class="m_-3769914636103084512m_8458828713642419313h5">Responses
inline.<br>
<span><br>
On 05/23/2018 02:26 PM, David
Davis wrote:<br>
> Before the release of Pulp
3.0 GA, I think it’s worth just
checking in to<br>
> make sure we want to use
UUIDs over integer based IDs.
Changing from UUIDs<br>
> to ints would be a very
easy change at this point (1-2
lines of code) but<br>
> after GA ships, it would be
hard if not impossible to
switch.<br>
> <br>
> I think there are a number
of reasons why we might want to
consider integer<br>
> IDs:<br>
> <br>
> - Better performance all
around for inserts[0], searches,
indexing, etc<br>
<br>
</span>I don't really care either
way, but it's worth pointing out
that UUIDs are<br>
integers (in the sense that the
entire internet can be reduced to
a single<br>
integer since it's all just bits).
To the best of my knowledge they
are equally<br>
performant to integers and stored
in similar ways in Postgres.<br>
<br>
You linked a MySQL experiment,
done using a version of MySQL that
is nearly 10<br>
years old. If there are concerns
about the performance of UUID PKs
vs. int PKs<br>
in Pulp, we should compare apples
to apples and profile Pulp using
UUID PKs,<br>
profile Pulp using integer PKs,
and then compare the two.<br>
<br>
In my small-scale testing (100,000
randomly generated content rows of
a<br>
proto-RPM content model, 1000
repositories randomly related to
each, no db funny<br>
business beyond enforced
uniqueness constraints), there was
either no<br>
difference, or what difference
there was fell into the margin of
error.<br>
<span><br>
> - Less storage required (4
bytes for int vs 16 byes for
UUIDs)<br>
<br>
</span>Well, okay...UUIDs are
*huge* integers. But it's the
length of an IPv6 address<br>
vs. the length of an IPv4 address.
While it's true that 4 < 16,
both are still<br>
pretty small. Trivially so, I
think.<br>
<br>
Without taking relations into
account, a table with a million
rows should be a<br>
little less than twelve
mega(mebi)bytes larger. Even at
scale, the size<br>
difference is negligible,
especially when compared to the
size on disk of the<br>
actual content you'd need to be
storing that those million rows
represent.<br>
<span><br>
> - Hrefs would be shorter
(e.g.
/pulp/api/v3/repositories/1/)<br>
> - In line with other apps
like Katello<br>
<br>
</span>I think these two are
definitely worth considering,
though.<br>
<span><br>
> There are some downsides to
consider though:<br>
> <br>
> - Integer ids expose info
like how many records there are<br>
<br>
</span>This was the main intent,
if I recall correctly. UUID PKs
are not:<br>
- monotonically increasing<br>
- variably sized (string length,
not bit length)<br>
<br>
So an objects PK doesn't give you
any indication of how many other
objects may<br>
be in the same collection, and
while the Hrefs are long, for any
given resource<br>
they will always be a predictable
size.<br>
<br>
The major downside is really that
they're a pain in the butt to type
out when<br>
compared to int PKs, so if users
are in a situation where they do
have to type<br>
these things out, I think
something has gone wrong.<br>
<br>
If users typing in PKs can't be
avoided, UUIDs probably should be
avoided. I<br>
recognize that this is effectively
a restatement of "Hrefs would be
shorter" in<br>
the context of how that impacts
the user.<br>
<span><br>
> - Can’t support sharding or
multiple dbs (are we ever going
to need this?)<br>
<br>
</span>A very good question. To
the best of my recollection this
was never stated as a<br>
hard requirement; it was only ever
mentioned like it is here, as a
potential<br>
positive side-effect of UUID keys.
If collision-avoidance is not
desired, and<br>
will certainly never be desired,
then a normal integer field would
likely be a<br>
less astonishing[0] user
experience, and therefore a better
user experience.<br>
<br>
[0]: <a
href="https://en.wikipedia.org/wiki/Principle_of_least_astonishment"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://en.wikipedia.org/wiki/Principle_of_least_astonishment</a><br>
<br>
<br>
</div>
</div>
<span>_______________________________________________<br>
Pulp-dev mailing list<br>
<a href="mailto:Pulp-dev@redhat.com"
target="_blank"
moz-do-not-send="true">Pulp-dev@redhat.com</a><br>
<a
href="https://www.redhat.com/mailman/listinfo/pulp-dev"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br>
<br>
</span></blockquote>
</div>
<br>
</div>
</div>
<br>
_______________________________________________<br>
Pulp-dev mailing list<br>
<a href="mailto:Pulp-dev@redhat.com"
target="_blank" moz-do-not-send="true">Pulp-dev@redhat.com</a><br>
<a
href="https://www.redhat.com/mailman/listinfo/pulp-dev"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
<!--'"--><br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Pulp-dev mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Pulp-dev@redhat.com">Pulp-dev@redhat.com</a>
<a class="moz-txt-link-freetext" href="https://www.redhat.com/mailman/listinfo/pulp-dev">https://www.redhat.com/mailman/listinfo/pulp-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>