[Avocado-devel] Avocado's Task Resource Management

Plamen Dimitrov plamen.dimitrov at pevogam.com
Fri Aug 4 19:50:40 UTC 2023


Hi Cleber,

Some comments on my side regarding your thoughts and proposal:

On 8/1/23 00:18, Cleber Rosa wrote:
> This write up is related to Avocado's issue #4994, which can be found
> at https://github.com/avocado-framework/avocado/issues/4994 .
> 
> Intro
> =====
> 
> Avocado's nrunner architecture separates the components that prepare and
> start the environment (a spawner) where a task (usually a test, but
> also test's requirements) will be executed.  A problem arises where
> either the spawner or the task itself allocates resources and to clean
> them up.
> 
> Spawners may have limited visibility on the resources created by a
> task (either directly or by the underlying runner).  For instance, the
> ``exec-test`` runner will create a new process that will actually run
> the executable test.  If such a process misbehaves and hangs, it may
> be left consuming CPU resources "forever".  That is not ideal, and
> "someone" should clean up after the misbehaving executable.
> 
> The question becomes: whose responsibility is to keep track and clean
> up such resources?  The obvious choices are:
> 
> 1. The spawner that started the task
> 2. The runner (create by the task) that actually created the resource

So far I tend to think that the resource management falls under the umbrella
of the spawner since the resources used and the configuration of their limits
depends entirely on the type of spawner use (e.g. max-number-of-processes,
open file, RAM allocation, cgroup management, etc. for LXC containers).

> The biggest problem with the first option (the spawner) is that it may
> have limited (or too coarse) visibility on the resources that were
> actually created by the task (or runner).

I think this can be remedied if only strict information pertaining to the
task is established in advance and any information regarding the task's
environment (and thus resources) is only allowed to be set by the spawner.
Or if I misunderstand do you mind specifying what are good examples of
what you mean by "resources" here?

> The biggest problem with
> the second option is that every runner will need to implement similar
> resource tracking and clean up.

I think functionality relating to the environment must be part of the
"environment manager" we call spawner.

> Using the existing spawners, and also the spawners under development,
> as examples, we can see give some more concrete examples:
> 
> A. The "process" spawner is able to know with good enough confidence
>     that one or more processes were created, either directly (the task
>     process) or by the runner within the task.  It can guess that
>     because it can look for all children processes of the task it
>     created.

I think in all situations we have means of looking up children processes
spawner by a top task process, local, remove, or within some container
namespace.

> B. The "podman" spawner is able to leverage the container technology
>     itself to clean up all the resources within the container (such as
>     many processes that may have been created by the task and runner.

And so can the LXC spawner so far.

> C. The "remote" spawner may have a much harder time to identify
>     resources that were created by a task or runner it started.  It may
>     require multiple sessions or multiple remote command executions to
>     query the children processes of the task.  If the system is
>     unresponsive because of a runaway task or test, that will become
>     harder or even impossible, and the "remote" spawner may only be
>     able to clean up the session it currently holds.

Since the processes are *not* started with a nohup option, processes are
usually not allowed to persist once we detach from the session. This should
then also serve as an easy way to perform a cleanup of any processes that
was started within the original session.

> So far, the scope of this resource management discussion is limited to
> processes.  The situation changes completely if resources, such as
> persistent storage, is considered.  The problem is not so much related
> to cleaning up resources themselves (such as removing a directory
> created by a task or runner), but how to clean up those resources in
> different environment (say locally with the process spawner, in a
> different machine with the remote spawner, etc).  Again, this is left
> to be discussed at a later time.

Can you provide an example of persistent storage? What would we like to
store persistently regarding a task?

> Proposal
> ========
> 
> For the first milestone of this work, my proposal is to:
> 
> 1. Properly document the capabilities of every spawner, including
>     their potential ability and strategy to destroy a task.
> 2. Provide the currently missing, but expected, capabilities of
>     spawners with regards to resources clean up.
> 3. Present to the user the list of resources that could the spawner
>     could not guarantee that were destroyed.

So far at least to the best of my understanding cleanup is quite easy
to do in all cases above but nothing stops us from documenting any
limitations or the lack of such we might be aware of.

> Documentation
> -------------
> 
> Letting users know about the capabilities and caveats of each spawner
> is the first step towards predictability and a more complete (future)
> set of features regarding resource management.
> 
> The goal here is to let users know what they can rely on, and what
> they can't.  For instance, it'd be fair to document the "remote"
> spawner more or less along the following lines:
> 
> "The remote spawner can properly destroy the SSH session it starts and
> maintains with the remote machine, and does a best effort attempt to
> destroy the task it started, but does **not** attempt to find and
> clean up all processes that the runner started".

And AFAIK nor does any other spawner. Should we rather take this to a
general set of notes regarding the spawners of any shared scope?

> Missing Features
> ----------------
> 
> The spawner interface "destroy_task()" currently won't bother checking
> if its attempt to clean up the task was successful.  Also, children
> processes of a task are not accounted for.
> 
> To improve this situation the following can and should be done:
> 
> 1. The process spawner should verify if the task process was
>     terminated.
> 2. The process spawner should try harder to terminate (and then kill)
>     the task's children processes.
> 3. The podman spawner should make sure that the container was fully
>     terminated.

All spawners can make a better effort to do such cleanup, I am just not
sure how needed this is if the environment itself allows for easy cleanup.

> Accountability
> --------------
> 
> By keeping track of the success (or failure) to clean up resources,
> it's possible to let users know and manually take action to clean up
> missed resources.
> 
> For instance, if the podman spawner fails to terminate a container, it
> should, at the very least, log a message such as:
> 
> "PodmanSpawner failed to destroy container with ID deadbeefdeadbeef.
> Podman reported error: xxxxxxxxx".

Good error reporting is vital to all operations of the spawners and not
just the clean up so I guess we can rather make a general better effort
there.

> Future
> ======
> 
> In the future, the resource managent could be expanded such that:
> 
> a. Different resources are accounted for (for example, storage)
> b. Support for "mix-and-match" of spawners and resources, such as
>     being able to manage a "storage" resource on both the "process"
>     and "remote" spawners.
> 
> There's no attempt at this point to determine the effort and
> feasibility of any of these.
> 

I hope this helps,
Plamen

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA4696276CE5A052D.asc
Type: application/pgp-keys
Size: 3159 bytes
Desc: OpenPGP public key
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20230804/74c4560c/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20230804/74c4560c/attachment.sig>


More information about the Avocado-devel mailing list