[Ovirt-devel] Thoughts about taskomatic redesign

Mon Jun 23 09:06:08 UTC 2008

Ian, et.al,
     I've been doing some thinking about both what taskomatic needs to do in its
next incarnation, along with ways of how to do it.

WHAT:
1)  Taskomatic needs to be able to run on multiple machines at the same time,
accessing a central database
2)  Taskomatic needs to be able to fire off tasks relating to different VMs (or
storage pools) concurrently (whether it's just run on one machine or many).

HOW:
1)  I think we should actually have two modes for taskomatic: standalone (i.e. I
am the only taskomatic), and multi-host (there are other taskomatics).  The
reason for this is in the standalone case, we probably want to fork one
taskomatic process for each VM (or storage pool) we want to perform actions on.
 In the multi-host case, we don't know how many other taskomatics might be out
there doing tasks, so we keep one process per machine (this should be a
command-line option/config file option)

2)  We need to lock rows in the database as each taskomatic wakes up and finds
work to do.  Luckily both postgres and activerecord support row locking, so the
underlying infrastructure is there.

In the standalone case, taskomatic should wake up, look at how many different
VMs (or storage pools) there are currently tasks queued for, and fork off that
many workers to do work (i.e. if you have start_vm 1, start_vm 2, stop_vm 1 in
the queue, you would fork off two workers).  Each worker would lock all of the
rows of the database corresponding with their VM (i.e. the first worker would
lock all rows having to do with VM 1), and then busy themselves with executing
the actions for that VM serially.  I guess the locking isn't strictly necessary
here, since we can tell each worker which VM or storage ID it should work on,
but it makes it more like the multihost case.

In the multihost case, things are a bit simpler; the taskomatic running on each
individual machine would just wake up, find the first task that is not in
progress and not locked, and lock all task rows having to do with that VM.  Then
it would execute these tasks and go back looking for more tasks.

Note that in both standalone and multihost case, it's OK for multiple
taskomatics to be sending commands to identical managed nodes.  Libvirtd itself
is serial, so commands might get intertwined, but that's OK since we are
explicitly making sure our taskomatics work on different VMs or storage pools.

3)  Transaction support in taskomatic (hi slinaberry!).  I'm not sure about this
one; we are modifying state external to the database, so I'm not sure
"rolling-back" a transaction means a whole hill of beans to us.  In fact, I
might argue that rolling back is worse in this case; if you modified external
state, and then crashed, when you come back you might "roll-back" your VM state
to something that's totally invalid, and you'll need to be corrected by
host-status anyway.  Does anyone have further thoughts here?

THOUGHTS:
Interestingly, I think we can evolve the current taskomatic to do this, rather
than re-writing the thing from scratch.  Since we cleaned up error reporting
handling and reporting, I actually feel a lot better about the state of
taskomatic.  It really just needs corner/error cases better handled, and then
introducing some of the above concepts one at a time.  Is there anything in
taskomatic right now that people are particularly unhappy about that might
warrant a re-write?

Chris Lalancette