[Avocado-devel] RFC: N(ext) Runner - A proposal to the finish line

Wed Jun 10 20:31:26 UTC 2020

On Thu, May 21, 2020 at 12:46:46PM -0300, Beraldo Leal wrote:
> Hi Cleber,
> 
> Thank you for opening this discussion here so we can try contributing to
> this core component.
> 
> Since that this thread is bringing up a few discussions in different
> parts (although related), for this reply, I will narrow down my comments
> only with suggestions on how we could organize this effort, discussions
> and ideas.
> 
> As you said, some items here could be an issue on Github (epic or not)
> and some could be a blueprint for a proper review and discussion.
> 
> So, to split this work, have better estimations and track the progress
> on this effort, one suggestion is to create goals or milestones with the
> issues that have a high priority and that are blocking for any other
> issue.
>

Hi Beraldo,

As we've defined that the project goals for the next release (81.0)
include both the Job API and the N(ext) Runner, it simplifies the
planning because we have *one* milestone for all related tasks.

> For each topic that you explained here, we could try to answer a few
> questions:
> 
>  - Is this an epic issue? Or a single and quick fix issue?
>  - If epic, what are the sub-issues?

For the N(ext) Runner, this epic issue has available:

   https://github.com/avocado-framework/avocado/issues/3700

>  - Is this blocking any other issue?
>  - Is this blocked by any other issue?
>  - Difficult level?
>  - Should we open a blueprint?

I think only the scheduler one is worthy of blueprint at this time.  

>

The N(ext) Runner related issues have been labeled with "nrun2run"
label:

  https://github.com/avocado-framework/avocado/issues?q=is%3Aissue+label%3Anrun2run

> Maybe this is very clear in your mind already, but IMO would be nice to
> have this also exposed to Github (not only the issues but tracking the
> progress. Maybe with a 'project, milestone or kanban column).
> 
> Following this approach, I made some comments in-line:
> 
> On Wed, May 20, 2020 at 07:32:59PM -0400, Cleber Rosa wrote:
> > Known issues and limitations of the current implementation
> > ==========================================================
> > 
> > Different Test IDs
> > ------------------
> > 
> > When running tests with the current runner, the Test IDs are different::
> > 
> >    $ avocado run --test-runner=runner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\"
> >             "id": "1-/bin/true",
> >             "id": "2-/bin/false",
> >             "id": "3-/bin/uname",
> > 
> >    $ avocado run --test-runner=nrunner --json=- -- /bin/true /bin/false /bin/uname | grep \"id\"
> >             "id": "1-1-/bin/true",
> >             "id": "2-2-/bin/false",
> >             "id": "3-3-/bin/uname",
> > 
> > The goal is to make the IDs the same.
> 
> This seems to be a non-epic issue.
>

Right, and a non-issue anymore, given that it was resolved in version 80.0:

   https://github.com/avocado-framework/avocado/issues/3864

> > Inability to run Tasks other than exec, exec-test, python-unittest (and noop)
> > -----------------------------------------------------------------------------
> > 
> > The current implementation of the nrunner plugin is based on the fact that
> > Tasks are already present at ``test_suite`` job attribute, and that running
> > Tasks can be (but shouldn't always be) a matter of iterating of the result
> > of its ``run()`` method.  This is part of the actual code::
> > 
> >     for status in task.run():
> >       result_dispatcher.map_method('test_progress', False)
> >       statuses.append(status)
> > 
> > The problem here is that only the Python classes implemented in the core
> > "avocado.core.nrunner" module, and registered at:
> > 
> >   https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.html#avocado.core.nrunner.RUNNERS_REGISTRY_PYTHON_CLASS
> > 
> > The goal is to have all other Python classes that inherit from
> > "avocado.core.nrunner.BaseRunner" available in such a registry.
> 
> IIUC, we can tackle this in multiple ways, from hard coded dicts to
> decorators and "register" methods, so that third-party Runners can
> register itself without the need to change the "avocado core code".
>

Right.

> At first glance, this seems to not be a "blueprint" case, so I would
> like to suggest opening a GH issue and if the discussion evolves to a
> complex solution, then we could think of a blueprint.  Probably you
> already have thought about this, so, maybe when opening the issue add
> there your suggestion to tackle this.
>

Agreed, not blueprint worthy, and the issue is here:

   https://github.com/avocado-framework/avocado/issues/3865

> > Inability to run Tasks with Spawners
> > ------------------------------------
> > 
> > While the "avocado nrun" command makes use of the Spawners, the
> > current implementation of the nrunner plugin described earlier,
> > calls a Task's ``run()`` method directly, and clearly doesn't
> > use spawners.
> > 
> > The goal here is to leverage spawners so that other isolation
> > models (or execution environments, depending how you look at
> > processes, containers, etc) are supported.
> 
> Same as previous comment.
>

Agreed, issue is:

   https://github.com/avocado-framework/avocado/issues/3866

> > Unoptmized execution of Tasks (extra serialization/deserialization)
> > -------------------------------------------------------------------
> > 
> > At this time, the nrunner plugin runs a Task directly through its
> > ``run()`` method.  Besides the earlier point of not supporting
> > other isolation models/execution environments (that means not using
> > spawners), there's an extra layer of work happening when running
> > a task which is most often not necessary: turning a Task instance
> > into a command line, and within its execution, turning it into a
> > Task instance again.
> > 
> > The goal is to support an optmized execution of the tasks, without
> > having to turn them into command lines, and back into Task instances.
> > The idea is already present in the spawning method definitions:
> > 
> >   https://avocado-framework.readthedocs.io/en/79.0/api/core/avocado.core.spawners.html#avocado.core.spawners.common.SpawnMethod.PYTHON_CLASS
> > 
> > And a PoC on top of the ``nrun`` command was implemented here:
> > 
> >   https://github.com/avocado-framework/avocado/pull/3766/commits/ae57ee78df7f2935e40394cdfc72a34b458cdcef
> 
> IMO, this is a core and important change, that is introducing or
> changing the task 'state machine', and maybe could receive an own
> blueprint just to discuss this.
>

I think we will not be able to put it aside, as we're implementing the
scheduler.  But, time will tell.  Anyway, there's an issue to track it
here:

   https://github.com/avocado-framework/avocado/issues/3867

> > Proposal
> > ========
> > 
> > Besides the known limitations listed previously, there are others that
> > will appear along the way, and certainly some new challeges as we
> > solve them.
> > 
> > The goal of this proposal is to attempt to identify those challenges,
> > and lay out a plan that can be tackled by the Avocado team/community
> > and not by a single person.
> > 
> > Task execution coordination goals
> > ---------------------------------
> > 
> > As stated earlier, to run a job, tasks must be executed. Differently
> > than the current runner, the N(ext) Runner architecture allows those
> > to be executed in a much more decoupled way. This characteristic will
> > be maintained, but it needs to be adapted into the current Job
> > execution.
> > 
> > From a high level view, the nrunner plugin needs to:
> > 
> > 1. Break apart from the "one at a time" Task execution model that it
> >    currently employs;
> > 
> > 2. Check if a Tasks can be executed, that is, if its requirements can
> >    be fulfilled (the most basic requirement for a task is a matching
> >    runner;
> > 
> > 3. Prepare for the execution of a task, such as the fulfillment of
> >    extra tasks requirements. The requirements resolver is one, if not
> >    the only way, component that should be given a chance to act here;
> > 
> > 4. Executes a task in prepared environment;
> > 
> > 5. Monitor the execution of a task (from an external PoV);
> > 
> > 6. Collect the status messages that tasks will send;
> > 
> >    a. Forward the status messages to the appropriate job components,
> >       such as the result plugins.
> > 
> >    b. Depending on the content of messages, such as the ones
> >       containing "status: started" or "status: finished", interfere in
> >       the Task execution status, and consequently, in the Job
> >       execution status.
> > 
> > 7. Verify, warn the user, and attempt to clean up stray tasks.  This
> >    may be for instance, necessary if a Task on a container seems to
> >    be stuck, and the container can not be destroyed.  The same applies
> >    to process in some time of uninterruptile sleeps.
> 
> Maybe this could be an epic issue?
>

I think so, yes. Along with also producing a blueprint.

> > Suggested terminology
> > ---------------------
> > 
> > Task execution has been requested
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > A Task whose execution was requested by the user.  All of the tasks on
> > a Job's ``test_suite`` attribute are requested tasks.
> > 
> > If a software component deals with this type of task, it's advisable
> > that it refers to ``TASK_REQUESTED`` or ``requested_tasks`` or a
> > similar name that links to this definition.
> > 
> > Task is being triaged
> > ~~~~~~~~~~~~~~~~~~~~~
> > 
> > The details of the task are being analyzed, including and most
> > importantly the ability of the system to *attempt* to fulfill its
> > requirements. A task leaves triage and it's either considered
> > "discarded" or proceeds to be prepared and then executed.
> > 
> > If a software component deals with this type of task, for instance if
> > a "task scheduler" is looking for runners matching the Task's kind, it
> > should keep it under a ``tasks_under_triage`` or mark the tasks as
> > ``UNDER_TRIAGE`` or ``TRIAGING`` a similar name that links to this
> > definition.
> > 
> > Task is being prepared
> > ~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Task has left triage, and it has not been discarded, that is, it's
> > a candidate to be setup, and if it goes well, executed.
> > 
> > The requirements for a task are being prepared in its reespective
> > isolation model/execution environment, that is, the spawner it'll
> > be executed with is known, and the setup actions will be visible
> > by the task.
> > 
> > If a software component deals with this type of task, for instance the
> > implementation of resolution of specific requirements, it should
> > should keep it under a ``tasks_preparing`` or mark the tasks as
> > ``PREPARING`` or a similar name that links to this definition.
> > 
> > Task is ready to be started
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Task has been prepared succesfully, and can now be executed.
> > 
> > If a software component deals with this type of task, it should
> > should keep it under a ``tasks_ready`` or mark the tasks as
> > ``READY`` or a similar name that links to this definition.
> > 
> > Task is being started
> > ~~~~~~~~~~~~~~~~~~~~~
> > 
> > A hopefully short lived state, in which a task that is ready to be started
> > (see previous point) will be given to the reespective spawner to be started.
> > 
> > If a software component deals with this type of task, it should should
> > keep it under a ``tasks_starting`` or mark the tasks as ``STARTING``
> > or a similar name that links to this definition.
> > 
> > The spawner should know if the starting of the task succeeded or failed,
> > and the task should be categorized accordingly.
> > 
> > Task has been started
> > ~~~~~~~~~~~~~~~~~~~~~
> > 
> > A task was successfully started by a spawner.
> > 
> > Note that it does *not* mean that the test that the task runner (say,
> > an "avocado-runner-$kind task-run" command) will run has already been
> > started.  This will be signalled by a "status: started" kind of
> > message.
> > 
> > If a software component deals with this type of task, it should should
> > keep it under a ``tasks_started`` or mark the tasks as ``STARTED`` or
> > a similar name that links to this definition.
> > 
> > Task has failed to start
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > Quite self explanatory. If the spawner failed to start a task, it
> > should be kept under a ``tasks_failed_to_start`` structure or be
> > marked as ``FAILED_TO_START`` or a similar name that links to this
> > definition.
> > 
> > Task is finished
> > ~~~~~~~~~~~~~~~~
> > 
> > This means that the task has started, and is now finished.  There's no
> > associated meaning here about the pass/fail output of the test payload
> > executed by the task.
> > 
> > It should be kept under a ``tasks_finished`` structure or be marked as
> > ``FINISHED`` or a similar name that links to this definition.
> > 
> > Task has been interrupted
> > ~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > This means that the task has started, but has not finished and it's
> > past due.
> > 
> > It should be kept under a ``tasks_interrupted`` structure or be marked
> > as ``INTERRUPTED`` or a similar name that links to this definition.
> >
> > Task workflow
> > -------------
> 
> I might be wrong, but this section and the previous one looks very close
> to a "life cycle/state machine" of a scheduler system. Something like
> the HPC job scheduler life-cycle, like Slurm, Condor, and many other
> systems. For sure this is a great good start.
>

It definitely has a lot of similarities, but it *must* be simpler than
those, as we want a trivial "bootstrap" (really, meaning *no* boostrap).
That is, users should continue to be able to running tests with Avocado
as they've done before (avocado list ..., avocado run ...).

> I do have some suggestions here, and questions about each state that you
> designed, maybe I would avoid one or two states, but looks like this
> should be part of a proper blueprint, maybe the same blueprint where we
> can discuss the "Task Scheduler" implementation along with the "Task
> Life Cycle".
>

Got it.  If you look at my initial implementation of this idea, you'll
find that it was a lot simpler than this, with less states.  Then, to
avoid not addressing all questions, I decided to be more verbose here.
Maybe we'll meet in the middle.

> > Summary
> > =======
> > 
> > This proposal contains a number of items that can become GitHub issues
> > at this stage.  It also contains a general explanation of what I believe
> > are the crucial missing features to make the N(ext) Runner implementation
> > available to the general public.
> > 
> > Feedback is highly appreciated, and it's expected that this document will
> > evolve into a better version, and possibly become a formal Blue Print.
> 
> +1 for the blueprint on the 'Task Scheduler/Life Cycle". ;)
> 
> As I said before, would be nice to have this big picture that you
> describe here as issues/milestones/projects/columns on GH so we can
> split this effort and have a better understanding of the this progress.
> 

Hopefully the issues and labels we have now are enough.

> Again, thanks for dumping your ideas here for discussion.
> 
> --
> Beraldo
> 

Thanks for the feedback, and let's follow up with the blueprint
discussion.

- Cleber.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20200610/3b968e6f/attachment.sig>