[Avocado-devel] Avocado's Task Resource Management

Mon Jul 31 21:18:08 UTC 2023

This write up is related to Avocado's issue #4994, which can be found
at https://github.com/avocado-framework/avocado/issues/4994 .

Intro
=====

Avocado's nrunner architecture separates the components that prepare and
start the environment (a spawner) where a task (usually a test, but
also test's requirements) will be executed.  A problem arises where
either the spawner or the task itself allocates resources and to clean
them up.

Spawners may have limited visibility on the resources created by a
task (either directly or by the underlying runner).  For instance, the
``exec-test`` runner will create a new process that will actually run
the executable test.  If such a process misbehaves and hangs, it may
be left consuming CPU resources "forever".  That is not ideal, and
"someone" should clean up after the misbehaving executable.

The question becomes: whose responsibility is to keep track and clean
up such resources?  The obvious choices are:

1. The spawner that started the task
2. The runner (create by the task) that actually created the resource

The biggest problem with the first option (the spawner) is that it may
have limited (or too coarse) visibility on the resources that were
actually created by the task (or runner).  The biggest problem with
the second option is that every runner will need to implement similar
resource tracking and clean up.

Using the existing spawners, and also the spawners under development,
as examples, we can see give some more concrete examples:

A. The "process" spawner is able to know with good enough confidence
   that one or more processes were created, either directly (the task
   process) or by the runner within the task.  It can guess that
   because it can look for all children processes of the task it
   created.
B. The "podman" spawner is able to leverage the container technology
   itself to clean up all the resources within the container (such as
   many processes that may have been created by the task and runner.
C. The "remote" spawner may have a much harder time to identify
   resources that were created by a task or runner it started.  It may
   require multiple sessions or multiple remote command executions to
   query the children processes of the task.  If the system is
   unresponsive because of a runaway task or test, that will become
   harder or even impossible, and the "remote" spawner may only be
   able to clean up the session it currently holds.

So far, the scope of this resource management discussion is limited to
processes.  The situation changes completely if resources, such as
persistent storage, is considered.  The problem is not so much related
to cleaning up resources themselves (such as removing a directory
created by a task or runner), but how to clean up those resources in
different environment (say locally with the process spawner, in a
different machine with the remote spawner, etc).  Again, this is left
to be discussed at a later time.

Proposal
========

For the first milestone of this work, my proposal is to:

1. Properly document the capabilities of every spawner, including
   their potential ability and strategy to destroy a task.
2. Provide the currently missing, but expected, capabilities of
   spawners with regards to resources clean up.
3. Present to the user the list of resources that could the spawner
   could not guarantee that were destroyed.

Documentation
-------------

Letting users know about the capabilities and caveats of each spawner
is the first step towards predictability and a more complete (future)
set of features regarding resource management.

The goal here is to let users know what they can rely on, and what
they can't.  For instance, it'd be fair to document the "remote"
spawner more or less along the following lines:

"The remote spawner can properly destroy the SSH session it starts and
maintains with the remote machine, and does a best effort attempt to
destroy the task it started, but does **not** attempt to find and
clean up all processes that the runner started".

Missing Features
----------------

The spawner interface "destroy_task()" currently won't bother checking
if its attempt to clean up the task was successful.  Also, children
processes of a task are not accounted for.

To improve this situation the following can and should be done:

1. The process spawner should verify if the task process was
   terminated.
2. The process spawner should try harder to terminate (and then kill)
   the task's children processes.
3. The podman spawner should make sure that the container was fully
   terminated.

Accountability
--------------

By keeping track of the success (or failure) to clean up resources,
it's possible to let users know and manually take action to clean up
missed resources.

For instance, if the podman spawner fails to terminate a container, it
should, at the very least, log a message such as:

"PodmanSpawner failed to destroy container with ID deadbeefdeadbeef.
Podman reported error: xxxxxxxxx".

Future
======

In the future, the resource managent could be expanded such that:

a. Different resources are accounted for (for example, storage)
b. Support for "mix-and-match" of spawners and resources, such as
   being able to manage a "storage" resource on both the "process"
   and "remote" spawners.

There's no attempt at this point to determine the effort and
feasibility of any of these.