<div dir="ltr"><div></div><div>I'm less concerned with the difference between autoincrement vs. UUID speed, and more concerned with how quickly performance was getting worse with database size on PostgreSQL in both cases (and not on MariaDB strangely). There's probably a *lot* that can be done to improve performance that we just haven't even looked into yet, and a few weeks of effort on that front would make a much larger difference (probably) than the type of PK. Not that we should disregard it entirely. The PK decision has to be made soon though and the work I mentioned will have to wait a bit.<br></div><div><br></div><div>But I do think that If we <i>really</i> wanted to support MySQL/MariaDB while retaining autoincrement PKs, the best option would be a small MySQL-specific reimplementation of "bulk_create()" that would just call .save() on all the objects in a loop. It would probably be *much* slower for MySQL, but it would be fairly simple (only a couple of lines), it would work for both without compromising PostgreSQL performance, it would avoid making the docs more confusing to users and it would be a lot less risky.<br></div><div><br></div><div>@Brian Would sharding actually be valuable? Have any Pulp users approached the sort of scale where it would be the right thing to do. From what I've heard, a single PostgreSQL installation is capable of handling 20 Terabytes without tremendous issue. I can't imagine Pulp's database growing so large that it would be more economical to manage a second database server than it would be to add more storage to the server you have. I can be convinced otherwise though.<br></div><div><br></div><div>I think a more compelling point 2 would be that in the multi-tenant use case, UUIDs would make it vastly more difficult for one API user to gather information on another user than autoincrement PKs. Which, even though we're not going to handle multi-tenant out of the gate, is a reasonable thing to think about and possibly a good reason to go in that direction.<br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 1, 2019 at 3:24 PM Brian Bouterse <<a href="mailto:bbouters@redhat.com">bbouters@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>I've finally gotten to read through the numbers and this thread. It is a tradeoff but I am +1 for switching to UUIDs. I focus on the PostgreSQL UUID vs int case because that is our default database. I don't think too much about how things perform on MariaDB because they can improve their own performance to catch up to PostgreSQL which regularly is performing better afaict. I agree with the assessment of 30% ish slowdown in the large unit cases for PostgreSQL. Still, I believe the advantages of switching to UUIDs are worth it. Two main reasons stick out in my mind.<br></div><div><br></div><div>1. Our core code and all plugin code will always be compatible with common db backends even when using bulk_create()<br></div><div>2. We get database sharding with postgresql which you can only do with UUID pks. I was advised this years ago by jcline.<br></div><div><br></div><div>Performance and compatibility are a pretty classic trade-off. Overall I've found that initial releases launch with less performance and improve (often significantly) overtime. Consider the interpreter pypy (not pypi). It started "roughly 2000x slower [at initial launch] than CPython, to roughly 7x faster [now]" [0]. Launching Pulp 3.0 that is 30% slower in the worst-case but runs everywhere with zero "db-behavior surprises" I think is worth it. Also conversely, if we don't adopt UUIDs, how will we address item 1 pre RC?</div><div><br></div><div>@dawalker for the "can we have both" option, we probably can have some db-specific codepaths, but I don't think doing an application wide PK type change as a setting is feasible to support. The db specific codepaths are one way performance improves over time. For the initial release, to keep things simple I hope we don't have conditional database codepaths (for now).</div><div><br></div><div>More discussion on this change is encouraged. Thanks @dalley so much for all the detailed investigation!</div><div><br></div><div>[0]: <a href="https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html" target="_blank">https://morepypy.blogspot.com/2018/09/the-first-15-years-of-pypy.html</a><br></div><div><br></div><div>Thank you,</div><div>Brian<br></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 1, 2019 at 2:51 PM Dana Walker <<a href="mailto:dawalker@redhat.com" target="_blank">dawalker@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>As I brought up on irc, I don't know how feasible the complications to maintenance would be going forward, but I would prefer if we could use some sort of settings in order to choose uuid or id based on MariaDB or PostgreSQL. I want us to work everywhere, but I'm really concerned at the impact to our users of a 30-40% efficiency drop in speed and storage.</div><div><br></div><div>David wrote up a quick Proof of Concept after I brought this up but wasn't necessarily advocating it himself. I think Daniel and Dennis expressed some concerns. I'd like to see more people discussing it here with reasoning/examples on how doable something like this could be?</div><div><br></div><div>If it's not on the table, I understand, but want to make sure we've considered all reasonable options, and that might not be a simple binary of either/or.</div><div><br></div><div>Thanks,</div><div><br></div><div>--Dana<br></div><div><br></div><div><div><div dir="ltr" class="gmail-m_5986274129377014627gmail-m_5851459133451292743gmail-m_-1256802369304774127m_-3039835796394319797gmail_signature"><div dir="ltr"><div> <p style="font-weight:bold;margin:0px;padding:0px;font-size:14px;text-transform:uppercase"><span>Dana</span> <span>Walker</span></p> <p style="font-weight:normal;font-size:10px;margin:0px 0px 4px;text-transform:uppercase"><span>Associate Software Engineer</span><span style="font-weight:normal;color:rgb(170,170,170);margin:0px"></span></p> <p style="font-weight:normal;margin:0px;font-size:10px;color:rgb(153,153,153)"><a style="color:rgb(0,136,206);font-size:10px;margin:0px;text-decoration:none;font-family:"overpass",sans-serif" href="https://www.redhat.com" target="_blank">Red Hat <span><br><br></span></a></p> <table border="0"><tbody><tr><td width="100px"><a href="https://red.ht/sig" target="_blank"> <img src="https://www.redhat.com/files/brand/email/sig-redhat.png" width="90" height="auto"></a> </td> </tr></tbody></table> </div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 1, 2019 at 9:15 AM David Davis <<a href="mailto:daviddavis@redhat.com" target="_blank">daviddavis@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I just want to bump this thread. If we hope to make the Pulp 3 RC date, we need feedback today.<br clear="all"><div><div dir="ltr" class="gmail-m_5986274129377014627gmail-m_5851459133451292743gmail-m_-1256802369304774127gmail-m_-3039835796394319797gmail-m_-1650898562000539570gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><br></div><div>David<br></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 27, 2019 at 5:09 PM Matt Pusateri <<a href="mailto:mpusater@redhat.com" target="_blank">mpusater@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Not sure if <a href="https://www.webyog.com/" target="_blank">https://www.webyog.com/</a> Monyog will give a free opensource project license. But that might help diagnose the MariaDB performance. Monyog is really nice, I wish it supported Postgres.</div><div dir="ltr"><br></div><div>Matt P. <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Feb 26, 2019 at 7:23 PM Daniel Alley <<a href="mailto:dalley@redhat.com" target="_blank">dalley@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hello all,</div><div><br></div><div>We've had an ongoing discussion about whether Pulp would be able to perform acceptably if we switched back to UUID primary keys. I've finished doing the performance testing and I *think* the answer is yes. Although to be honest, I'm not sure that I understand why, in the case of MariaDB.</div><div><br></div><div>I linked my testing methodology and results here: <a href="https://pulp.plan.io/issues/4290#note-18" target="_blank">https://pulp.plan.io/issues/4290#note-18</a></div><div><br></div><div>To summarize, I tested the following:</div><div><br></div><div>* How long it takes to perform subsequent large (lazy) syncs, with lots of content in the database (100-400k content units)<br></div><div>* How long it takes to perform various small but important database queries<br></div><div><br></div><div>The results were weirdly in contrast in some cases.</div><div><br></div><div>The first four syncs (202,000 content total) behaved mostly the same on PostgreSQL whether it used an autoincrement or UUID primary key. Subsequent syncs had a performance drop of between 30-40%. Likewise, the code snippets performed 30+% worse. Sync time scaled linearly"ish" with the amont of content in the repository in both cases, which was a bit surprising to me. The size of the database at the end was 30-40% larger with UUID primary keys, 736 MB vs 521 MB. The gap would be smaller in typical usage when you consider that most content types have more metadata than FileContent (what I was testing).<br></div><div><br></div><div>Autoincrement PostgreSQL (left) vs. UUID PostgreSQL (right) in diff form<br></div><div><a href="https://www.diffchecker.com/40AF8vvM" target="_blank">https://www.diffchecker.com/40AF8vvM</a></div><div><br></div><div>With MariaDB the first sync was almost 80% slower than the first sync w/ PostgreSQL, but every subsequent sync was as fast or faster, despite the tests of specific queries performing multiple times worse. Additionally the sync performance did not decrease as rapidly as it did under PostgreSQL. With MariaDB, one of my test queries that worked fine when backed by PostgreSQL ended up hanging endlessly and I had to cut it off after 25 or so minutes. [0] I would consider that a blocker to claiming we support MariaDB / MySQL.<br></div><div><br></div><div>But overall I'm not sure how to interpret the fact that on one hand the real-usage performance is equal or better better, and on the performance of some of the underlying queries is noticably worse. Maybe there's some weird caching going on in the backend, or the generated indexes are different?<br></div><div><br></div><div>UUID PostgreSQL (left) vs. UUID MariaDB (right) in diff form</div><div><a href="https://www.diffchecker.com/W1nnIQgj" target="_blank">https://www.diffchecker.com/W1nnIQgj</a></div><div><br></div><div>I'd like to invite some discussion on this, but nothing I've mentioned seems like it would be a problem for going forwards with using UUID primary keys in a general sense. If we're all in agreement about that engineering decision then we can move forwards with that work.<br></div><div><br></div><div>[0] for *some* but not all repository versions. No idea what's up there.<br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div></div></div></div></div> _______________________________________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br> </blockquote></div> _______________________________________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br> </blockquote></div> _______________________________________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br> </blockquote></div> _______________________________________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br> </blockquote></div> _______________________________________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com" target="_blank">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/mailman/listinfo/pulp-dev</a><br> </blockquote></div>