[Avocado-devel] RFC: Multi tests (previously multi-host test) [v2]

Thu Mar 31 15:55:26 UTC 2016

Hello guys,

This is a v2 of the multi tests RFC, previously known as multi-host RFC.

Changes:

     v2: Rewritten from scratch
     v2: Added examples for the demonstration to avoid confusion
     v2: Removed the mht format (which was there to demonstrate manual 
execution)
     v2: Added 2 solutions for multi-tests
     v2: Described ways to support synchronization

The problem
===========

A user wants to run netperf on 2 machines, which requires following 
manual steps:

     machine1: netserver -D
     machine1: # Wait till netserver is initialized
     machine2: netperf -H $machine1 -l 60
     machine2: # Wait till it finishes and report store the results
     machine1: # stop the netserver and report possible failures

Another use-cases might be:

1. triggering several un-related tests in parallel
2. triggering several tests in parallel with synchronization
3. spreading several tests into multiple machines
4. triggering several various tests on multiple machines

The problem is not only about running tests on multiple machines, but 
generally about ways to trigger tests/set of tests in whatever way the 
user needs to.

Running the tests
=================

In v1 we rejected the idea to run custom code from inside the tests in 
bacground as it requires implementing the remote-tests again and again 
and we decided that executing full tests or set of tests with support 
for remote synchronization/data exchange is the way to go. There were 
two-three bigger categories so let's describe each so we can pick the 
most suitable one (at this moment).

For demonstration purposes I'll be writing very simple multi-host test 
which triggers on 3 machines "/usr/bin/wget example.org" to simulate 
very basic stress tests.

Synchronization and parametrization will not be covered in this section 
as synchronization will be described in the next chapter and is the same 
for all solutions and parametrization is a standard avocado feature.

Internal API
------------

One of the ways to allow people to trigger tests and set of tests (jobs) 
from inside test is to pick the minimal required set of internal API 
which handles remote job execution, make it public (and supported) and 
refactor it so it can be realistically called from inside test.

Example (pseudocode)

     class WgetExample(avocado.Test):
         jobs = []
         for i, machine in enumerate(["127.0.0.1", "192.168.122.2",
                                      "192.168.122.3"]):
             jobs.append(avocado.Job(urls=["/usr/bin/wget example.org"],
                                     remote_machine=machine,
                                     logdir=os.path.join(self.logdir,
                                                         i)))
         for job in jobs:
             job.run_background()
         errors = []
         for i, job in enumerate(jobs):
             result = job.wait()     # returns json results
             if result["pass"] != result["total"]:
                 errors.append("Tests on worker %s (%s) failed"
                               % (i, machines[i]))
         if errors:
             self.fail("Some workers failed:\n%s" % "\n".join(errors))

alternatively even require the user to define the whole workflow:

1. discover test (loader)
2. add params/variants
3. setup remote execution (RemoteTestRunner)
4. setup results (RemoteResults)

which would require even more internal API to be turned public.

+ easy to develop, we simply identify set of classes and make them public
- hard to maintain as the API would have to stay stable, therefor 
realistically it requires big cleanup before doing this step

Multi-tests API
--------------

To avoid the need to make the API which drives testing public, we can 
also introduce an API to trigger jobs/set of jobs. It would be sort of 
proxy between internal API, which can and changes more-often an the 
public multi-host API which would be supported and kept stable.

I see two basic backends supporting this API, but they both share the 
same public API.

Example (pseudocode)

     class WgetExample(avocado.MultiTest):
         for machine in ["127.0.0.1", "192.168.122.2", "192.168.122.3"]):
             self.add_worker(machine)
         for worker in self.workers:
             worker.add_test("/usr/bin/wget example.org")
         #self.start()
         #results = self.wait()
         #if results["failures"]:
         #    self.fail(results["failures"])
         self.run()  # does the above

The basic set of API should contain:

* MultiTest.workers - list of defined workers
* MultiTest.add_worker(machine="localhost") - to add new sub-job
* MultiTest.run(timeout=None) - to start all workers, wait for results 
and fail the current test if any of the workers reported failure
* MultiTest.start() - start testing in background (allow this test to 
monitor or interact with the workers)
* MultiTest.wait(timeout=None) - wait till all workers finish
* Worker.add_test(url) - add test to be executed
* Worker.add_tests(urls) - add list of tests to be executed
* Worker.abort() - abort the execution

I didn't wanted to talk about params but they are essential for 
multi-tests. I think we should allow passing default params for all tests:

* Worker.params(params) - where params should be in any supported format 
by Test class (currently AvocadoParams or dict)

or per test during "add_test":

* Worker.add_test(url, params=None) - again, params should be any 
supported format (currently only possible via internal API, but even 
without multi-tests I'm fighting for such support on the command line)

Another option could be to allow supplying all "test" arguments using 
**kwargs inside the "add_test":

* Worker.add_test(url, **kwargs=None) -> discover_url and override test 
arguments if provided (currently only possible via internal API, 
probably never possible on the command line, but the arguments are 
methodName, name, params, base_logdir, tag, job, runner_queue and I 
don't see a value in overriding any them but the params)

API backed by internal API
~~~~~~~~~~~~~~~~~~~~~~~~~~

This would implement the multi-test API using the internal API (from 
avocado.core).

+ runs native python
+ easy interaction and development
+ easily extensible by either using internal API (and risk changes) or 
by inheriting and extending the features.
- lots of internal API will be involved, thus with almost every change 
of internal API we'd have to adjust this code to keep the MultiTest working
- fabric/paramiko is not thread/parallel process safe and fails badly so 
first we'd have to rewrite our remote execution code (use autotest's 
worker, or aexpect+ssh)

API backed by cmdline
~~~~~~~~~~~~~~~~~~~~~

This would implement the multi-test API by translating it into "avocado 
run" commands during "self.start()".

+ easy to debug as users are used to the "avocado run" syntax and issues
+ allows manual mode where users trigger the "avocado run" manually
+ cmdline args are part of public API so they should stay stable
+ no issues with fabric/paramiko as each process is separate
+ even easier extensible as one just needs to implement the feature for 
"avocado run" and then can use it as extra_params in the worker, or send 
PR to support it in the stable environment.
- only features available on the cmdline can be supported (currently not 
limiting)
- rely on stdout parsing (but avocado supports machine readable output)

Synchronization
===============

Some tests does not need any synchronization, users just need to run 
them. But some multi-tests needs to be synchronized or they need to 
exchange data. For synchronization usually "barriers" are used, where 
barrier requires a "name" and "number of clients". One requests entry 
into barrier guarded section, it's interrupted until "number of clients" 
are waiting for it (or timeout is reached).

To do so the test needs and IP address+port where the synchronization 
server is listening. We can start this from the multi-test and only 
support it this way:

     self.sync_server.start(addr=None, port=None)  # start listening
     self.sync_server.stop()    # stop listening
     self.sync_server.details   # contact information to be used by workers

Alternatively we might even support this on the command line to allow 
manual execution:

     --sync-server [addr[:port]] - listen on addr:port (pick one by default)
     --sync addr:port - when barrier/data exchange is used, use 
addr:port to contact sync server.

The  cmdline argument would allow manual executions, for example for 
testing purposes or execution inside custom build systems (jenkins, 
beaker, ...) without the multi-test support.

The result is the same, avocado listens on some port and the spawned 
workers connect to this port, identify themselves and ask for 
barriers/data exchange, with the support for re-connection. To do so we 
have various possibilities:

Standard multiprocess API
-------------------------

The standard python's multiprocessing library contains over the TCP 
synchronization. The only problem is that "barriers" were introduced in 
python3 so we'd have to backport it and it does not fit all our needs so 
we'd have to tweak it a bit.

Autotest's syncdata
-------------------

Python 2.4 friendly, supports barriers and data synchronization. On the 
contrary it's quite hackish and full of shortcuts.

Custom code
-----------

We can inspire by the above and create simple human-readable (easy to 
debug or interact with manually) protocol to support barriers and data 
exchange via pickling. IMO that would be easier to maintain than 
backporting and adjusting of the multiprocessing or fixing the autotest 
syncdata. A proof-of-concept can be found here:

     https://github.com/avocado-framework/avocado/pull/1019

It modifies the "passtest" to be only executed when it's executed by 2 
workers at the same time. It does not support the multi-tests yet, so 
one has to run "avocado run passtest" twice using the same 
"--sync-server" (once --sync-server and once --sync).

Conclusion
==========

Given the reasons I like the idea of "API backed by cmdline" as all 
cmdline options are stable, the output is machine readable and known to 
users so easily to debug manually.

For synchronization that requires the "--sync" and "--sync-server" 
arguments to be present, also not necessarily used when the users uses 
the multi-test (the multi-test can start the the server if not already 
started and add "--sync" for each worker if not provided).

The netperf example from introduction would look like this:

The client tests are ordinary "avocado.Test" tests that can even be 
executed manually without any synchronization (by providing no_client=1)

     class NetServer(avocado.Test):
         def setUp(self):
             process.run("netserver")
             self.barrier("setup", params.get("no_clients"))
         def test(self):
             pass
         def tearDown(self):
             self.barrier("finished", params.get("no_clients"))
             process.run("killall netserver")

     class NetPerf(avocado.Test):
         def setUp(self):
             self.barrier("setup", params.get("no_clients"))
         def test(self):
             process.run("netperf -H %s -l 60"
                         % params.get("server_ip"))
             barrier("finished", params.get("no_clients"))

One would be able to run this manually (or from build systems) using:

     avocado run NetServer --sync-server $IP:12345 &
     avocado run NetPerf --sync $IP:12345 &

(one would have to hardcode or provide the "no_clients" and "server_ip" 
params on the cmdline)

and the NetPerf would wait till NetServer is initialized, then it'd run 
the test while NetServer would wait till it finishes. For some users 
this is sufficient, but let's add the multi-test test to get a single 
results (pseudo code):

     class MultiNetperf(avocado.MultiTest):
         machines = params.get("machines")
         assert len(machines) > 1
         for machine in params.get("machines"):
             self.add_worker(machine, sync=True)     # enable sync server
         self.workers[0].add_test("NetServer")
         self.workers[0].set_params({"no_clients": len(self.workers)})
         for worker in self.workers[1:]:
             worker.add_test("NetPerf")
             worker.set_parmas({"no_clients": len(self.workers),
                                "server_ip": machines[0]})
         self.run()

Running:

     avocado run MultiNetperf

would run a single test, which based on the params given to the test 
would run on several machines using the first machine as server and the 
rest as clients and all of them would start at the same time.

It'd produce a single results with one test id and following structure 
(example):

     $ tree $RESULTDIR
       └── test-results
           └── simple.mht
               ├── job.log
                   ...
               ├── 1
               │   └── job.log
                       ...
               └── 2
                   └── job.log
                       ...

where 1 and 2 are the results of worker 1 and worker 2. For all of the 
solution proposed those would give the user the standard results as they 
know them from normal avocado executions, each with a unique id, which 
should help analyzing and debugging the results.