[Avocado-devel] RFC: Multi-host tests

Lukáš Doktor ldoktor at redhat.com
Sat Mar 26 19:01:15 UTC 2016


Hello guys,

Let's open a discussion regarding the multi-host tests for avocado.

The problem
===========

A user wants to run netperf on 2 machines. To do it manually he does:

     machine1: netserver -D
     machine1: # Wait till netserver is initialized
     machine2: netperf -H $machine1 -l 60
     machine2: # Wait till it finishes and report store the results
     machine1: # stop the netserver and report possible failures

Now how to support this in avocado, ideally as custom tests, ideally 
even with broken connections/reboots?


Super tests
===========

We don't need to do anything and leave everything on the user. He is 
free to write code like:

     ...
     machine1 = aexpect.ShellSession("ssh $machine1")
     machine2 = aexpect.ShellSession("ssh $machine2")
     machine1.sendline("netserver -D")
     # wait till the netserver starts
     machine1.read_until_any_line_matches(["Starting netserver"], 60)
     output = machine2.cmd_output("netperf -H $machine1 -l $duration")
     # interrupt the netserver
     machine1.sendline("\03")
     # verify netserver finished
     machine1.cmd("true")
     ...

the problem is it requires active connection and the user needs to 
manually handle the results.


Triggered simple tests
======================

Alternatively we can say each machine/worker is nothing but yet another 
test, which occasionally needs a synchronization or data-exchange. The 
same example would look like this:

machine1.py:

    process.run("netserver")
    barrier("server-started", 2)
    barrier("test-finished", 2)
    process.run("killall netserver")

machine2.py:

     barrier("server-started", 2)
     self.log.debug(process.run("netperf -H %s -l 60"
                                % params.get("server_ip"))
     barrier("test-finished", 2)

where "barrier(name, no_clients)" is a framework function which makes 
the process wait till the specified number of processes are waiting for 
the same barrier.

The barrier needs to know which server to use for communication so we 
can either create a new service, or simply use one of the executions as 
"server" and make both processes use it for data exchange. So to run the 
above tests the user would have to execute 2 avocado commands:

     avocado run machine1.py --sync-server machine1:6547
     avocado run machine2.py --remote-hostname machine2 --mux-inject 
server_ip:machine1 --sync machine1:6547

where:
     --sync-server tells avocado to listen on ip address machine1 port 6547
     --remote-hostname tells the avocado to run remotely on machine2
     --mux-inject adds the "server_ip" into params
     --sync tells the second avocado to connect to machine1:6547 for 
synchronization

Running those two tests has only one benefit compare to the previous 
solution and that is it gathers the results independently and makes 
allows one to re-use simple tests. For example you can create a 3rd 
test, which uses different params for netperf, run it on "machine2" and 
keep the same script for "machine1". Or running 2 netperf senders at the 
same time. This would require libraries and more custom code when using 
"Super test" approach.

There are additional benefits for this solution. When we introduce the 
locking API, tests running on a remote machine will be actually directly 
executed in avocado, therefor the locking API will work for them, 
avoiding problems with multiple tests using the same shared resource.

Another future benefit would be system reboot/lost connection when we 
introduce this support for individual tests. The way it'd work is that 
user triggers the jobs, the master remembers the test ids and would poll 
for results until they finish/timeout.

All of this we get for free thanks to re-using the existing 
infrastructure (or the future infrastructure), so I believe this is the 
right way to go and in this RFC I'm describing details of this approach.


Triggering the jobs
-------------------

Previous example required the user to run the avocado 2 times (per each 
machine) and sharing the same sync server. Additionally it resulted into 
2 separated results. Let's try to eliminate this problem.


Basic tests
~~~~~~~~~~~

For basic setups, we can come up with very simple format to describe 
which tests should be triggered and avocado should take care of 
executing it. The way I have in my mind is to simply accept list of 
"avocado run" commands:

simple_multi_host.mht:

     machine1.py
     machine2.py --remote-hostname machine2 --mux-inject server_ip:machine1

Running this test:

     avocado run simple_multi_host.mht --sync-server 0.0.0.0

avocado would pick a free port and start the sync server on it. Then it 
would prepend "avocado run" and append "--sync $sync-server 
--job-results-dir $this-job-results" to each line in 
"simple_multi_host.mht" and run them in parallel. Afterward it'd wait 
till both processes finish and report pass/fail depending on the status.

This way users get overall results as well as individual ones and simple 
way to define static setups.


Contrib scripts
~~~~~~~~~~~~~~~

The beauty of executing simple lines is, that users might create contrib 
scripts to generate the "mht" files to get even better flexibility.


Advanced tests
~~~~~~~~~~~~~~

The above might still not be flexible enough. But the system underneath 
is very simple and flexible. So how about creating instrumented tests, 
which generate the setup? The same simple example as before:

multi_host.py

     runners = ["machine1.py"]
     runners.append("machine2.py --remote-hostname machine2 --mux-inject 
server_ip:machine1")
     self.execute(runners)

where the "self.execute(tests)" would take the list and does the same as 
for basic tests. Optionally it could return the json results per each 
tests so the test itself can react and modify the results.

The above was just a direct translation of the previous example, but to 
demonstrate the real power of this let's try a PingPong multi host test:

     class PingPong(MultiHostTest):
         def test(self):
             hosts = self.params.get("hosts", default="").split(";")
             assert len(hosts) >= 2
             runners = ["ping_pong --remote-hostname %s" % _
                             for _ in hosts]
             # Start creating multiplex tree interactively
             mux = MuxVariants("variants")
             # add /run/variants/ping with {} values
             mux.add("ping", {"url": hosts[1], "direction": "ping",
                              "barrier": "ping1"})
             # add /run/variants/pong with {} values
             mux.add("pong", {"url": hosts[-1], "direction": "pong",
                              "barrier": "ping%s" % len(hosts) + 1})
             # Append "--mux-inject mux-tree..." to the first command
             runners[0] += "--mux-inject %s" % mux.dump()
             for i in xrange(1, len(hosts)):
                 mux = MuxVariants("variants")
                 next_host = hosts[i+1 % len(hosts)]
                 prev_host = hosts[i-1]
                 mux.add("pong", {"url": prev_host, "direction": "pong",
                                  "barrier": "ping%s" % i})
                 mux.add("ping", {"url": next_host, "direction": "ping",
                                  "barrier": "ping%s" % i+1})
                 runners[i] += "--mux-inject %s" % mux.dump()
             # Now do the same magic as in basic multihost test on
             # the dynamically created scenario
             self.execute(runners)

The `self.execute` generates the "simple test"-like list of "avocado 
run" commands to be executed. But the test writer can define some 
additional behavior. In this example it generates 
machine1->machine2->...->machine1 chain of ping-pong tests.

When running "avocado run pingpong --mux-inject hosts:machine1;machine2" 
this generates 2 jobs, both running just a single "ping_pong" test with 
2 multiplex variants:

machine1:

     variants: !mux
         ping:
             url: machine2
             direction: pong
             barrier: ping1
         pong:
             url: machine2
             direction: pong
             barrier: ping2
machine2:

     variants: !mux
         pong:
             url: machine1
             direction: pong
             barrier: ping1
         ping:
             url: machine1
             direction: ping
             barrier: ping2

The first multiplex tree for three machines looks like this:

     variants: !mux
         ping:
             url: machine2
             direction: pong
             barrier: ping1
         pong:
             url: machine3
             direction: pong
             barrier: ping

Btw I simplified the format for the sake of this RFC. I think instead of 
generating the strings we should support API to specify test, 
multiplexer, options... and then turn them into the parallel executed 
jobs (usually remotely). But these are just details to be solved if we 
decide to work on it.


Results and the UI
==================

The idea is, that the user is free to run the jobs separately, or to 
define the setup in a "wrapper" job. The benefit of using the "wrapper" 
job are the results in one place and the `--sync` handling.

The difference is that running them individually looks like this:

     1 | avocado run ping_pong --mux-inject url:192.168.1.58:6001 
--sync-server
     1 | JOB ID     : 6057f4ea2c99c43670fd7d362eaab6801fa06a77
     1 | JOB LOG    : 
/home/medic/avocado/job-results/job-2016-01-22T05.33-6057f4e/job.log
     1 | SYNC       : 0.0.0.0:6001
     1 | TESTS      : 1
     1 |  (1/1) ping_pong: \
     2 | avocado run ping_pong --mux-inject :url::6001 direction:pong 
--sync 192.168.1.1:6001 --remote-host 192.168.1.1
     2 | JOB ID     : 6057f4ea2c99c43670fd7d362eaab6801fa06a77
     2 | JOB LOG    : 
/home/medic/avocado/job-results/job-2016-01-22T05.33-6057f4e/job.log
     2 | TESTS      : 1
     2 |  (1/1) ping_pong: PASS
     1 |  (1/1) ping_pong: PASS

and you have 2 results directories and 2 statuses. By running them 
wrapped inside simple.mht test you get:

     avocado run simple.mht --sync-server 192.168.122.1
     JOB ID     : 6057f4ea2c99c43670fd7d362eaab6801fa06a77
     JOB LOG    : 
/home/medic/avocado/job-results/job-2016-01-22T05.33-6057f4e/job.log
     TESTS      : 1
      (1/1) simple.mht: PASS
     RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0
     TIME       : 0.00 s

And single results:

     $ tree $RESULTDIR

     └── test-results
         └── simple.mht
             ├── job.log
                 ...
             ├── 1
             │   └── job.log
                     ...
             └── 2
                 └── job.log
                     ...

     tail -f job.log:
     running avocado run ping pong ping pong
     running avocado run pong ping pong ping --remote-hostname 
192.168.122.53
     waiting for processes to finish...
     PASS avocado run ping pong ping pong
     FAIL avocado run pong ping pong ping --remote-hostname 192.168.122.53
     this job FAILED


Demonstration
=============

While considering the design I developed a WIP example. You can find it 
here:

     https://github.com/avocado-framework/avocado/pull/1019

It demonstrates the `Triggered simple tests` chapter without the 
wrapping tests. Hopefully it helps you understand what I had in mind. It 
contains modified "examples/tests/passtest.py" which requires 2 
concurrent executions (for example if you want to test your server and 
run multiple concurrent "wget" connections). Feel free to play with it, 
change the number of connections, set different barriers, combine 
multiple different tests...


Autotest
========

Avocado was developed by people familiar with Autotest, so let's just 
mention here, that this method is not all that different from Autotest 
one. The way Autotest supports parallel execution is it let's users to 
create the "control" files inside the multi-host-control-file and then 
run those in parallel. For synchronization it contains master->slave 
barrier mechanism extended of SyncData to send pickled data to all 
registered runners.

I considered if we should re-use the code, but:

1. we do not support control files, so I just inspired by passing the 
params to the remote instances
2. the barriers and syncdata are quite hackish, master->slave 
communication. I think the described (and demonstrated) approach does 
the same in a less hackish way and is easy to extend

Using this RFC we'd be able to run autotest-multi-host tests, but it'd 
require rewriting the control files to "mht" (or contrib) files. It'd be 
probably even possible to write a contrib script to run the control file 
and generate the "mht" file which would run the autotest test. Anyway 
the good think for us is, that this does not affect "avocado-vt", 
because all of the "avocado-vt" multi-host tests are using a single 
"control" file, which only prepares the params for simple avocado-vt 
executions. The only necessary thing is a custom "tests.cfg" as by 
default it disallows multi-host tests (or we can modify the "tests.cfg" 
and include the filter inside the "avocado-vt" loader, but these are 
just the details to be sorted when we start running avocado-vt 
multi-host tests.

Conclusion
==========

Multi-host testing was solved many times in the history. Some hardcode 
tests with communication, but most framework I had seen support 
triggering "normal/ordinary" tests and add some kind of barrier (either 
inside the code or between the tests) mechanism to synchronize the 
execution. I'm for the flexibility and easy test sharing and that is how 
I described it here.

Kind regards,
Lukáš




More information about the Avocado-devel mailing list