[Pulp-dev] rethinking workers vs queues

Mon Oct 30 22:26:25 UTC 2017

While it's on my mind, I just want to get this idea out to others for
future consideration. I do not think we should necessarily make any changes
to Pulp 3.0 based on this.

Setup
-------

What is a Pulp worker? We tend to think of them as a process, or pair of
processes in parent-child relationship, with a number from 0-7 (or a higher
number if you configure Pulp as such). Each worker has a systemd unit file
and a queue. We know how many should be running and monitor them. If you
have multiple machines, each machine has a defined set of numbered workers.

Pulp tracks each worker in the database. Why? For resource reservation. For
any given resource (usually a repository), all not-complete tasks are
assigned to the same worker so they go into one FIFO queue, which preserves
order-of-operation. Having one worker per queue guarantees that no more
than one task will run at a time for a given resource.

Difficulty arises when we deal with workers going offline. What if a worker
dies unexpectedly and leaves its queue behind, orphaned? How can we quiesce
a worker (stop assigning it work) so it can be taken offline gracefully? In
a clustered environment, such as Pulp running in Kubernetes or OpenShift,
users will expect the ability to scale the number of workers up and down,
and so we'll need to address these challenges. The containerized-Pulp use
case helps clarify, I think, the role of workers vs. queues.

Pitch
------

Workers are stateless processes. They are a commodity that should come and
go just as easily as the processes that handle http requests. The only
long-term state associated with a worker is its queue, and I propose that
we (eventually) stop defining a queue based on which worker created it.

Today: a worker starts, creates a queue for itself, and informs Pulp it is
ready to receive work in that queue.

Future: a worker starts, the worker informs Pulp it is ready, and Pulp
tells the worker which queues it should work from.

Queues become the first-class resource in Pulp that tasks are assigned to.
Pulp monitors workers to ensure that each queue is assigned to exactly one
healthy worker, but it does not care as much which one.

Use Cases
--------------

If a worker process dies and a new one starts up, Pulp can assign the
orphaned queue to the new worker.

If a worker dies (gracefully or not) and a new one does not show up, Pulp
can assign the orphaned queue to another worker, which would do double-duty
until one of the queues was emptied, at which point Pulp could choose to
delete that queue.

If a new additional worker shows up, Pulp could potentially assign it only
to the general "celery" queue. Based on some policy, a new
resource-reserving queue could optionally be created in the future, only
if/when it was needed, and assigned to that worker.

Pulp as a clustered app would own and manage a pool of queues. The number
of queues would be influenced by user settings (maybe a min and max), how
much work is being requested at any given time, and how many processes are
available to do work. The cluster would manage the full lifecycle of each
queue.

Pulp would monitor a pool of workers who are effectively anonymous. They
would have no meaningful identity from a scheduling standpoint. They come
and go through outside influence, but the application would make no effort
to manage their lifecycle. Pulp would only tell each worker which queues it
should work from.

Summary
-----------

Details aside, the important points are:

- Focus on the queue as the owner of state.
- For purposes of scheduling tasks, worker processes are anonymous.
- Pulp manages a pool of queues, monitors a pool of workers, and assigns
queues to workers as workers come and go.

Thoughts? Would it help to elaborate with concrete examples? Maybe a
metaphor...

Black Friday
---------------

Extending our familiar Black Friday metaphor... starting with a re-cap.

Customers at a retail store are standing in one long line to check out. A
traffic-cop at the head of the line tells each person which register to go
to, based on some rules. (each register represents a worker's queue).

This proposal is that we should think about the line at each register
separately from the cashier. (the line is a queue, and the cashier is a
worker process) One cashier coming on duty can take over another's register
so they can go on break. If a cashier has to close their register to go on
break, the cashier next-door might run back-and-forth between two registers
for a while until one of the lines is empty. An entire shift of 16 fresh
cashiers might show up and relieve the previous shift. (similar to
migrating worker processes from one machine in a cluster to another; the
queues stay the same, but they get matched with new anonymous workers)

-- 

Michael Hrivnak

Principal Software Engineer, RHCE

Red Hat
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/pulp-dev/attachments/20171030/dcb43983/attachment.htm>