[Avocado-devel] RFC: N(ext) Runner - A proposal to the finish line

Thu May 21 15:46:46 UTC 2020

Hi Cleber,

Thank you for opening this discussion here so we can try contributing to
this core component.

Since that this thread is bringing up a few discussions in different
parts (although related), for this reply, I will narrow down my comments
only with suggestions on how we could organize this effort, discussions
and ideas.

As you said, some items here could be an issue on Github (epic or not)
and some could be a blueprint for a proper review and discussion.

So, to split this work, have better estimations and track the progress
on this effort, one suggestion is to create goals or milestones with the
issues that have a high priority and that are blocking for any other
issue.

For each topic that you explained here, we could try to answer a few
questions:

 - Is this an epic issue? Or a single and quick fix issue?
 - If epic, what are the sub-issues?
 - Is this blocking any other issue?
 - Is this blocked by any other issue?
 - Difficult level?
 - Should we open a blueprint?

Maybe this is very clear in your mind already, but IMO would be nice to
have this also exposed to Github (not only the issues but tracking the
progress. Maybe with a 'project, milestone or kanban column).

Following this approach, I made some comments in-line:

On Wed, May 20, 2020 at 07:32:59PM -0400, Cleber Rosa wrote:
> Known issues and limitations of the current implementation
> ==========================================================
> 
> Different Test IDs
> ------------------
> 
> When running tests with the current runner, the Test IDs are different::
> 
>    $ avocado run --test-runner=runner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\"
>             "id": "1-/bin/true",
>             "id": "2-/bin/false",
>             "id": "3-/bin/uname",
> 
>    $ avocado run --test-runner=nrunner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\"
>             "id": "1-1-/bin/true",
>             "id": "2-2-/bin/false",
>             "id": "3-3-/bin/uname",
> 
> The goal is to make the IDs the same.

This seems to be a non-epic issue.

> Inability to run Tasks other than exec, exec-test, python-unittest (and noop)
> -----------------------------------------------------------------------------
> 
> The current implementation of the nrunner plugin is based on the fact that
> Tasks are already present at ``test_suite`` job attribute, and that running
> Tasks can be (but shouldn't always be) a matter of iterating of the result
> of its ``run()`` method.  This is part of the actual code::
> 
>     for status in task.run():
>       result_dispatcher.map_method('test_progress', False)
>       statuses.append(status)
> 
> The problem here is that only the Python classes implemented in the core
> "avocado.core.nrunner" module, and registered at:
> 
>   https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.nrunner.RUNNERS_REGISTRY_PYTHON_CLASS
> 
> The goal is to have all other Python classes that inherit from
> "avocado.core.nrunner.BaseRunner" available in such a registry.

IIUC, we can tackle this in multiple ways, from hard coded dicts to
decorators and "register" methods, so that third-party Runners can
register itself without the need to change the "avocado core code".

At first glance, this seems to not be a "blueprint" case, so I would
like to suggest opening a GH issue and if the discussion evolves to a
complex solution, then we could think of a blueprint.  Probably you
already have thought about this, so, maybe when opening the issue add
there your suggestion to tackle this.

> Inability to run Tasks with Spawners
> ------------------------------------
> 
> While the "avocado nrun" command makes use of the Spawners, the
> current implementation of the nrunner plugin described earlier,
> calls a Task's ``run()`` method directly, and clearly doesn't
> use spawners.
> 
> The goal here is to leverage spawners so that other isolation
> models (or execution environments, depending how you look at
> processes, containers, etc) are supported.

Same as previous comment.

> Unoptmized execution of Tasks (extra serialization/deserialization)
> -------------------------------------------------------------------
> 
> At this time, the nrunner plugin runs a Task directly through its
> ``run()`` method.  Besides the earlier point of not supporting
> other isolation models/execution environments (that means not using
> spawners), there's an extra layer of work happening when running
> a task which is most often not necessary: turning a Task instance
> into a command line, and within its execution, turning it into a
> Task instance again.
> 
> The goal is to support an optmized execution of the tasks, without
> having to turn them into command lines, and back into Task instances.
> The idea is already present in the spawning method definitions:
> 
>   https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.spawners.html#avocado.core.spawners.common.SpawnMethod.PYTHON_CLASS
> 
> And a PoC on top of the ``nrun`` command was implemented here:
> 
>   https://github.com/avocado-framework/avocado/pull/3766/commits/ae57ee78df7f2935e40394cdfc72a34b458cdcef

IMO, this is a core and important change, that is introducing or
changing the task 'state machine', and maybe could receive an own
blueprint just to discuss this.

> Proposal
> ========
> 
> Besides the known limitations listed previously, there are others that
> will appear along the way, and certainly some new challeges as we
> solve them.
> 
> The goal of this proposal is to attempt to identify those challenges,
> and lay out a plan that can be tackled by the Avocado team/community
> and not by a single person.
> 
> Task execution coordination goals
> ---------------------------------
> 
> As stated earlier, to run a job, tasks must be executed. Differently
> than the current runner, the N(ext) Runner architecture allows those
> to be executed in a much more decoupled way. This characteristic will
> be maintained, but it needs to be adapted into the current Job
> execution.
> 
> From a high level view, the nrunner plugin needs to:
> 
> 1. Break apart from the "one at a time" Task execution model that it
>    currently employs;
> 
> 2. Check if a Tasks can be executed, that is, if its requirements can
>    be fulfilled (the most basic requirement for a task is a matching
>    runner;
> 
> 3. Prepare for the execution of a task, such as the fulfillment of
>    extra tasks requirements. The requirements resolver is one, if not
>    the only way, component that should be given a chance to act here;
> 
> 4. Executes a task in prepared environment;
> 
> 5. Monitor the execution of a task (from an external PoV);
> 
> 6. Collect the status messages that tasks will send;
> 
>    a. Forward the status messages to the appropriate job components,
>       such as the result plugins.
> 
>    b. Depending on the content of messages, such as the ones
>       containing "status: started" or "status: finished", interfere in
>       the Task execution status, and consequently, in the Job
>       execution status.
> 
> 7. Verify, warn the user, and attempt to clean up stray tasks.  This
>    may be for instance, necessary if a Task on a container seems to
>    be stuck, and the container can not be destroyed.  The same applies
>    to process in some time of uninterruptile sleeps.

Maybe this could be an epic issue?

> Suggested terminology
> ---------------------
> 
> Task execution has been requested
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> A Task whose execution was requested by the user.  All of the tasks on
> a Job's ``test_suite`` attribute are requested tasks.
> 
> If a software component deals with this type of task, it's advisable
> that it refers to ``TASK_REQUESTED`` or ``requested_tasks`` or a
> similar name that links to this definition.
> 
> Task is being triaged
> ~~~~~~~~~~~~~~~~~~~~~
> 
> The details of the task are being analyzed, including and most
> importantly the ability of the system to *attempt* to fulfill its
> requirements. A task leaves triage and it's either considered
> "discarded" or proceeds to be prepared and then executed.
> 
> If a software component deals with this type of task, for instance if
> a "task scheduler" is looking for runners matching the Task's kind, it
> should keep it under a ``tasks_under_triage`` or mark the tasks as
> ``UNDER_TRIAGE`` or ``TRIAGING`` a similar name that links to this
> definition.
> 
> Task is being prepared
> ~~~~~~~~~~~~~~~~~~~~~~
> 
> Task has left triage, and it has not been discarded, that is, it's
> a candidate to be setup, and if it goes well, executed.
> 
> The requirements for a task are being prepared in its reespective
> isolation model/execution environment, that is, the spawner it'll
> be executed with is known, and the setup actions will be visible
> by the task.
> 
> If a software component deals with this type of task, for instance the
> implementation of resolution of specific requirements, it should
> should keep it under a ``tasks_preparing`` or mark the tasks as
> ``PREPARING`` or a similar name that links to this definition.
> 
> Task is ready to be started
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Task has been prepared succesfully, and can now be executed.
> 
> If a software component deals with this type of task, it should
> should keep it under a ``tasks_ready`` or mark the tasks as
> ``READY`` or a similar name that links to this definition.
> 
> Task is being started
> ~~~~~~~~~~~~~~~~~~~~~
> 
> A hopefully short lived state, in which a task that is ready to be started
> (see previous point) will be given to the reespective spawner to be started.
> 
> If a software component deals with this type of task, it should should
> keep it under a ``tasks_starting`` or mark the tasks as ``STARTING``
> or a similar name that links to this definition.
> 
> The spawner should know if the starting of the task succeeded or failed,
> and the task should be categorized accordingly.
> 
> Task has been started
> ~~~~~~~~~~~~~~~~~~~~~
> 
> A task was successfully started by a spawner.
> 
> Note that it does *not* mean that the test that the task runner (say,
> an "avocado-runner-$kind task-run" command) will run has already been
> started.  This will be signalled by a "status: started" kind of
> message.
> 
> If a software component deals with this type of task, it should should
> keep it under a ``tasks_started`` or mark the tasks as ``STARTED`` or
> a similar name that links to this definition.
> 
> Task has failed to start
> ~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Quite self explanatory. If the spawner failed to start a task, it
> should be kept under a ``tasks_failed_to_start`` structure or be
> marked as ``FAILED_TO_START`` or a similar name that links to this
> definition.
> 
> Task is finished
> ~~~~~~~~~~~~~~~~
> 
> This means that the task has started, and is now finished.  There's no
> associated meaning here about the pass/fail output of the test payload
> executed by the task.
> 
> It should be kept under a ``tasks_finished`` structure or be marked as
> ``FINISHED`` or a similar name that links to this definition.
> 
> Task has been interrupted
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> This means that the task has started, but has not finished and it's
> past due.
> 
> It should be kept under a ``tasks_interrupted`` structure or be marked
> as ``INTERRUPTED`` or a similar name that links to this definition.
>
> Task workflow
> -------------

I might be wrong, but this section and the previous one looks very close
to a "life cycle/state machine" of a scheduler system. Something like
the HPC job scheduler life-cycle, like Slurm, Condor, and many other
systems. For sure this is a great good start.

I do have some suggestions here, and questions about each state that you
designed, maybe I would avoid one or two states, but looks like this
should be part of a proper blueprint, maybe the same blueprint where we
can discuss the "Task Scheduler" implementation along with the "Task
Life Cycle".

> Summary
> =======
> 
> This proposal contains a number of items that can become GitHub issues
> at this stage.  It also contains a general explanation of what I believe
> are the crucial missing features to make the N(ext) Runner implementation
> available to the general public.
> 
> Feedback is highly appreciated, and it's expected that this document will
> evolve into a better version, and possibly become a formal Blue Print.

+1 for the blueprint on the 'Task Scheduler/Life Cycle". ;)

As I said before, would be nice to have this big picture that you
describe here as issues/milestones/projects/columns on GH so we can
split this effort and have a better understanding of the this progress.

Again, thanks for dumping your ideas here for discussion.

--
Beraldo