[Pulp-dev] rethinking workers vs queues
bbouters at redhat.com
Tue Oct 31 13:31:44 UTC 2017
This is a great recap/proposal of some discussions @mhrivnak and I have had
before. We're pretty similar in our approach to resolving this. I want to
restate the pain point motivating it also. The problem: When a worker goes
offline, the tasks that are in its queue get cancelled which is surprising
to the user. This bug is tracked as issue #489  and its one of the
biggest pain points with the tasking system. Pulp3 is also affected by this
I think a mini-version of this solution would also resolve issue #489.
Specifically we could continue to use the "Dedicated queue" feature of
celery, but we could add a new "recovery" workflow where a queue with work
in it is orphaned due to a worker being stopped/killed and you have to
route that work to another worker. Either way though, I think we could
easily agree on a plan to fix this that would work. My main question is:
On Mon, Oct 30, 2017 at 6:26 PM, Michael Hrivnak <mhrivnak at redhat.com>
> While it's on my mind, I just want to get this idea out to others for
> future consideration. I do not think we should necessarily make any changes
> to Pulp 3.0 based on this.
> What is a Pulp worker? We tend to think of them as a process, or pair of
> processes in parent-child relationship, with a number from 0-7 (or a higher
> number if you configure Pulp as such). Each worker has a systemd unit file
> and a queue. We know how many should be running and monitor them. If you
> have multiple machines, each machine has a defined set of numbered workers.
> Pulp tracks each worker in the database. Why? For resource reservation.
> For any given resource (usually a repository), all not-complete tasks are
> assigned to the same worker so they go into one FIFO queue, which preserves
> order-of-operation. Having one worker per queue guarantees that no more
> than one task will run at a time for a given resource.
> Difficulty arises when we deal with workers going offline. What if a
> worker dies unexpectedly and leaves its queue behind, orphaned? How can we
> quiesce a worker (stop assigning it work) so it can be taken offline
> gracefully? In a clustered environment, such as Pulp running in Kubernetes
> or OpenShift, users will expect the ability to scale the number of workers
> up and down, and so we'll need to address these challenges. The
> containerized-Pulp use case helps clarify, I think, the role of workers vs.
> Workers are stateless processes. They are a commodity that should come and
> go just as easily as the processes that handle http requests. The only
> long-term state associated with a worker is its queue, and I propose that
> we (eventually) stop defining a queue based on which worker created it.
> Today: a worker starts, creates a queue for itself, and informs Pulp it is
> ready to receive work in that queue.
> Future: a worker starts, the worker informs Pulp it is ready, and Pulp
> tells the worker which queues it should work from.
> Queues become the first-class resource in Pulp that tasks are assigned to.
> Pulp monitors workers to ensure that each queue is assigned to exactly one
> healthy worker, but it does not care as much which one.
> Use Cases
> If a worker process dies and a new one starts up, Pulp can assign the
> orphaned queue to the new worker.
> If a worker dies (gracefully or not) and a new one does not show up, Pulp
> can assign the orphaned queue to another worker, which would do double-duty
> until one of the queues was emptied, at which point Pulp could choose to
> delete that queue.
> If a new additional worker shows up, Pulp could potentially assign it only
> to the general "celery" queue. Based on some policy, a new
> resource-reserving queue could optionally be created in the future, only
> if/when it was needed, and assigned to that worker.
> Pulp as a clustered app would own and manage a pool of queues. The number
> of queues would be influenced by user settings (maybe a min and max), how
> much work is being requested at any given time, and how many processes are
> available to do work. The cluster would manage the full lifecycle of each
> Pulp would monitor a pool of workers who are effectively anonymous. They
> would have no meaningful identity from a scheduling standpoint. They come
> and go through outside influence, but the application would make no effort
> to manage their lifecycle. Pulp would only tell each worker which queues it
> should work from.
> Details aside, the important points are:
> - Focus on the queue as the owner of state.
> - For purposes of scheduling tasks, worker processes are anonymous.
> - Pulp manages a pool of queues, monitors a pool of workers, and assigns
> queues to workers as workers come and go.
> Thoughts? Would it help to elaborate with concrete examples? Maybe a
> Black Friday
> Extending our familiar Black Friday metaphor... starting with a re-cap.
> Customers at a retail store are standing in one long line to check out. A
> traffic-cop at the head of the line tells each person which register to go
> to, based on some rules. (each register represents a worker's queue).
> This proposal is that we should think about the line at each register
> separately from the cashier. (the line is a queue, and the cashier is a
> worker process) One cashier coming on duty can take over another's register
> so they can go on break. If a cashier has to close their register to go on
> break, the cashier next-door might run back-and-forth between two registers
> for a while until one of the lines is empty. An entire shift of 16 fresh
> cashiers might show up and relieve the previous shift. (similar to
> migrating worker processes from one machine in a cluster to another; the
> queues stay the same, but they get matched with new anonymous workers)
> Michael Hrivnak
> Principal Software Engineer, RHCE
> Red Hat
> Pulp-dev mailing list
> Pulp-dev at redhat.com
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Pulp-dev