<div dir="ltr"><div>This is a great recap/proposal of some discussions @mhrivnak and I have had before. We're pretty similar in our approach to resolving this. I want to restate the pain point motivating it also. The problem: When a worker goes offline, the tasks that are in its queue get cancelled which is surprising to the user. This bug is tracked as issue #489 [0] and its one of the biggest pain points with the tasking system. Pulp3 is also affected by this bug.<br><br></div>I think a mini-version of this solution would also resolve issue #489. Specifically we could continue to use the "Dedicated queue" feature of celery, but we could add a new "recovery" workflow where a queue with work in it is orphaned due to a worker being stopped/killed and you have to route that work to another worker. Either way though, I think we could easily agree on a plan to fix this that would work. My main question is: when?<br><div><div><br>[0]: <a href="https://pulp.plan.io/issues/489">https://pulp.plan.io/issues/489</a><br></div><div><br></div><div>-Brian<br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Oct 30, 2017 at 6:26 PM, Michael Hrivnak <span dir="ltr"><<a href="mailto:mhrivnak@redhat.com" target="_blank">mhrivnak@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>While it's on my mind, I just want to get this idea out to others for future consideration. I do not think we should necessarily make any changes to Pulp 3.0 based on this.<br></div><div><br></div><div>Setup</div><div>-------</div><div><br></div><div>What is a Pulp worker? We tend to think of them as a process, or pair of processes in parent-child relationship, with a number from 0-7 (or a higher number if you configure Pulp as such). Each worker has a systemd unit file and a queue. We know how many should be running and monitor them. If you have multiple machines, each machine has a defined set of numbered workers.</div><div><br></div><div>Pulp tracks each worker in the database. Why? For resource reservation. For any given resource (usually a repository), all not-complete tasks are assigned to the same worker so they go into one FIFO queue, which preserves order-of-operation. Having one worker per queue guarantees that no more than one task will run at a time for a given resource.</div><div><br></div><div>Difficulty arises when we deal with workers going offline. What if a worker dies unexpectedly and leaves its queue behind, orphaned? How can we quiesce a worker (stop assigning it work) so it can be taken offline gracefully? In a clustered environment, such as Pulp running in Kubernetes or OpenShift, users will expect the ability to scale the number of workers up and down, and so we'll need to address these challenges. The containerized-Pulp use case helps clarify, I think, the role of workers vs. queues.</div><div><br></div><div>Pitch<br clear="all"><div>------</div><div><br></div><div>Workers are stateless processes. They are a commodity that should come and go just as easily as the processes that handle http requests. The only long-term state associated with a worker is its queue, and I propose that we (eventually) stop defining a queue based on which worker created it.</div><div><br></div><div>Today: a worker starts, creates a queue for itself, and informs Pulp it is ready to receive work in that queue.</div><div><br></div><div>Future: a worker starts, the worker informs Pulp it is ready, and Pulp tells the worker which queues it should work from.</div><div><br></div><div>Queues become the first-class resource in Pulp that tasks are assigned to. Pulp monitors workers to ensure that each queue is assigned to exactly one healthy worker, but it does not care as much which one.</div><div><br></div><div>Use Cases</div><div>--------------</div><div><br></div><div>If a worker process dies and a new one starts up, Pulp can assign the orphaned queue to the new worker.</div><div><br></div><div>If a worker dies (gracefully or not) and a new one does not show up, Pulp can assign the orphaned queue to another worker, which would do double-duty until one of the queues was emptied, at which point Pulp could choose to delete that queue.</div><div><br></div><div>If a new additional worker shows up, Pulp could potentially assign it only to the general "celery" queue. Based on some policy, a new resource-reserving queue could optionally be created in the future, only if/when it was needed, and assigned to that worker.</div><div><br></div><div>Pulp as a clustered app would own and manage a pool of queues. The number of queues would be influenced by user settings (maybe a min and max), how much work is being requested at any given time, and how many processes are available to do work. The cluster would manage the full lifecycle of each queue.</div><div><br></div><div>Pulp would monitor a pool of workers who are effectively anonymous. They would have no meaningful identity from a scheduling standpoint. They come and go through outside influence, but the application would make no effort to manage their lifecycle. Pulp would only tell each worker which queues it should work from.</div><div><br></div><div>Summary</div><div>-----------</div><div><br></div><div>Details aside, the important points are:</div><div><br></div><div>- Focus on the queue as the owner of state.</div><div>- For purposes of scheduling tasks, worker processes are anonymous.</div><div>- Pulp manages a pool of queues, monitors a pool of workers, and assigns queues to workers as workers come and go.</div><div><br></div><div>Thoughts? Would it help to elaborate with concrete examples? Maybe a metaphor...</div><div><br></div><div>Black Friday</div><div>---------------</div><div><br></div><div>Extending our familiar Black Friday metaphor... starting with a re-cap.</div><div><br></div><div>Customers at a retail store are standing in one long line to check out. A traffic-cop at the head of the line tells each person which register to go to, based on some rules. (each register represents a worker's queue).</div><div><br></div><div>This proposal is that we should think about the line at each register separately from the cashier. (the line is a queue, and the cashier is a worker process) One cashier coming on duty can take over another's register so they can go on break. If a cashier has to close their register to go on break, the cashier next-door might run back-and-forth between two registers for a while until one of the lines is empty. An entire shift of 16 fresh cashiers might show up and relieve the previous shift. (similar to migrating worker processes from one machine in a cluster to another; the queues stay the same, but they get matched with new anonymous workers)</div><span class="HOEnZb"><font color="#888888"><div><br></div>-- <br><div class="m_-1261701856255069385gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Michael</span> <span style="margin:0px!important;padding:0px!important">Hrivnak</span></p><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"></p><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><span style="margin:0px!important;padding:0px!important">Principal Software Engineer</span><span style="margin:0px!important;padding:0px!important">, <span style="margin:0px!important;padding:0px!important">RHCE</span></span> </span><span style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px"></span><br style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important"><p style="color:rgb(0,0,0);font-family:overpass-mono,monospace;font-size:10px;margin:0px!important;padding:0px!important">Red Hat</p></div></div> </font></span></div></div> <br>______________________________<wbr>_________________<br> Pulp-dev mailing list<br> <a href="mailto:Pulp-dev@redhat.com">Pulp-dev@redhat.com</a><br> <a href="https://www.redhat.com/mailman/listinfo/pulp-dev" rel="noreferrer" target="_blank">https://www.redhat.com/<wbr>mailman/listinfo/pulp-dev</a><br> <br></blockquote></div><br></div>