[Avocado-devel] Multi Stream Test Support

Tue Apr 18 17:17:37 UTC 2017

Hello Cleber,

few clarifications (hopefully) before you come up with v2

...
>>>
>>>> Another option would be to allow `path` of the Avocado Test `params` and
>>>> the class would get the details from the provided params+path. The cons
>>>> would be it'd be hard to use anything but `params` for that:
>>>>
>>>>     streams:
>>>>         first:
>>>>             type: localhost
>>>>         second:
>>>>             hostname: 192.168.122.10
>>>>             user: test
>>>>             password: 123456
>>>>         third:
>>>>             type: docker
>>>>             create: yes
>>>>         ...
>>>>
>>>
>>> I like this structure better.
>>>
>>>>     get_implementation(self.params, "/streams/first/*") => would use
>>>> `params.get(..., "/streams/first/*")` to get all necessary parameters
>>>>
>>>
>>> The idea of `get_implementation` is pretty simple: get an Execution
>>> Stream implementation by its name.  What you're describing here (with
>>> regards to getting the "right" parameters and instantiating/activating
>>> the execution stream easily/automatically) is exactly what I propose
>>> later.
>>>
>> This is not really about the automatic creation. As for today it's
>> impossible to map tests-to-variants and I saw some people are actually
>> writing tests which depends on certain path so they can supply different
>> params to different tests:
>>
>
> First of all, the design shouldn't be limited by current limitations
> that can be addressed (although it's a very good thing to mention and
> keep track of them).
>
> Furthermore, I'd say the problem here is not mapping tests to variants
> at all.  We're using YAML (yaml_to_mux) simply as a didactic way of
> setting variables at specific paths, and that is the essence here.
>
> Maybe I'm missing something, but I don't see a reason why the same
> "*/avocado/stream/*" parameters couldn't be available to all tests as
> "Default params":
>
> http://avocado-framework.readthedocs.io/en/48.0/TestParameters.html#default-params
>
> And still be overridden by specific tests.
>
> If you agree that's achievable but currently limited or non functional
> because of addressable limitations, then I believe we're fine, and just
> have another work item to deliver multi stream test support.
>
Basically what I ask here is whether you considered the possibility to 
define the `streams` not using the default params path, but from a 
different path:

so instead of just:

    avocado:
       streams:
        - 1:
            type: remote
            host: foo.example.com

I'd like to be able to pass:

    First:
       streams:
        - 1:
            type: remote
            host: foo.example.com
    Second:
       streams:
        - 1:
            type: remote
            host: foo.example.com
    avocado:
       streams:
        - 1:
            type: remote
            host: foo.example.com

And in test to say this time please use `/Second/*` instead of 
`/avocado/*` to define the default streams, because I'm running multiple 
different tests in the single execution and I want to define different 
streams for each of them.

>>     boot:
>>         timeout: 10
>>     check:
>>         timeout: 1
>>     shutdown:
>>         timeout: 5
>>
>> and in boot.py:
>>
>>     timeout = self.params.get("timeout", "*/boot/*")
>>
>> in check.py:
>>
>>     timeout = self.params.get("timeout", "*/check/*")
>>
>> ...
>>
>>
>> So the comment above tried to describe how one would create descriptions
>> of streams on different places and be still able to get them in various
>> tests from different locations. Imagine in test1.py:
>>
>>     get_implementation(self.params, "*/test1/streams/first/*")
>>
>> in test2.py:
>>
>>     get_implementation(self.params, "*/test2/streams/server/*")
>>
>> and so on. The automatic way would be shared for all tests, while this
>> would allow the same but using different locations. Anyway this is
>> really painful and I'd prefer using the single-line definition, or the
>> list of tuples as described later...
>>
>>>>>           if klass is not None:
>>>>>               stream = klass(host=self.params.get("remote_hostname"),
>>>>>
>>>>> username=self.params.get("remote_username")
>>>>>
>>>>> password=self.params.get("remote_password"))
>>>>>               cmd = "ping -c 1 %s" %
>>>>> self.params.get("test_host_hostname")
>>>>>               stream.run(shell(cmd))
>>>> I do like the rest of the example (only the klass would be already
>>>> initialized by the `get_implementation`.
>>>>
>>>
>>> This conflicts with the idea of "utility library to be used by test
>>> developer that wants full control of the individual creation and use of
>>> the execution streams".  I mean, if some implementations are made
>>> available at the `avocado.utils` namespace, then they could even be
>>> referred to directly by module/name.
>> Yes, by
>> `get_implementation("docker://create=yes,image=fedora,remove=always") or
>> `get_implementation(host=host, username=username, ...)`, which returns
>> already initialized object of the detected type. Alternatively the user
>> would import directly the specific class and use
>> `DockerStream(image="fedora")`.
>>
>> Again, the better name for `get_implementation` in this sense would be
>> `get_stream`.
>>
>
> I think you missed the point about `get_implementation()`.  Again,
> `get_stream()` could be implemented on top of it.  And remember
> that this is the  "verbose, lengthy and boilerplate-full" method that we
> hope most users won't need or use.
>
> I'd really focus most of our energy on giving the majority of users (the
> majority of use cases) a very easy to use set of tools.
>
Sure, my aim is to provide very simple way for most cases while still 
allowing the hard-core users to use our libraries directly to be able to 
do everything. Anyway I agree `get_stream` could be implemented on top 
of `get_implementation` but right now I don't see much points in having 
`get_implementation` at all. Looking forward to the v2 to hopefully 
learn about it's benefits.

>>>
>>> Again, since we're talking about making whatever makes sense in the
>>> utility namespace, there could a function such as
>>> "get_stream_parameters(name)" that would return the parameters to an
>>> Execution Stream, that is:
>>>
>>>   stream_name = "server"
>>>   path = "/avocado/streams/%s/*" % stream_name
>>>   impl = self.params.get("type", path=path)
>>>   stream = get_implementation(impl)(**get_stream_parameters(path))
>>>
>>
>> Actually it should probably be `self.streams.get_stream()` as we do want
>> to register it in the `Streams` to get it cleaned automatically. This
>> also works for libraries as you'd first initialize `Streams` object and
>> then do the `get_stream()`, `close()` and so on directly on the
>> `Streams` object.
>>
>
> Can you explain a use case where you'd write code to manually create a
> stream using the object that is supposed to contain the ready to use
> streams?  If you need to manually create it, then it's probably a good
> idea for you to manually destroy it.  Of course we could add mechanisms
> to register streams created outside of `self.streams`, but I don't feel
> the urge to do so unless proven wrong.
>
> Anyway, this seems a corner case to me, and the most difficult and
> important task here is to try to get the "recommended" way right, so
> that most multi stream tests are really straightforward to read and write.
>
Sure, this was more-like a brainstorm of the possible scalability of 
this design...

>>>>>
>>>>> Please note that this is not the intended end result of this proposal,
>>>>> but
>>>>> a side effect of implementing it using different software layers.  Most
>>>>> users should favor the simplified (higher level) interface.
>>>>>
>>>>> Writing a Multi-Stream test
>>>>> ===========================
>>>>>
>>>>> As mentioned before, users have not yet been given tools **and
>>>>> guidelines** for writing multi-host (multi-stream in Avocado lingo)
>>>>> tests.  By setting a standard and supported way to use the available
>>>>> tools, we can certainly expect advanced multi-stream tests to become
>>>>> easier to write and then much more common, robust and better supported
>>>>> by Avocado itself.
>>>>>
>>>>> Mapping from parameters
>>>>> -----------------------
>>>>>
>>>>> The separation of stream definitions and test is a very important goal
>>>>> of this proposal.  Avocado already has a advanced parameter system, in
>>>> of this proposal.  Avocado already has __an__ advanced parameter
>>>> system, in
>>>>
>>>>> which a test received parameters from various sources.The most common
>>>>> way of passing parameters at this point is by means of YAML files, so
>>>>> these will be used as the example format.
>>>> Well this might be quite hard to understand, how about just saying:
>>>> "Avocado supports test parametrisation via Test Parameters system and
>>>> the most common way is to use a YAML file by using `yaml_to_mux`
>>>> plugin."
>>>>
>>>>>
>>>>> Parameters that match a predefined schema (based on paths and node
>>>>> names) will be by evaluated by a tests' ``streams`` instance
>>>>> (available as ``self.streams`` within a test).
>>>>>
>>>>> For instance, the following snippet of test code::
>>>>>
>>>>>   from avocado import Test
>>>>>
>>>>>   class MyTest(Test):
>>>>>       def test(self):
>>>>>           self.streams[1].run(python("import mylib; mylib.action()"))
>>>>>
>>>>> Together with the following YAML file fed as input to the parameter
>>>>> system::
>>>>>
>>>>>   avocado:
>>>>>      streams:
>>>>>       - 1:
>>>>>           type: remote
>>>>>           host: foo.example.com
>>>> This is currently not supported by our yaml parser as any dictionary is
>>>> mapped to multiplex structure and I'm not sure it'd be possible (in a
>>>> sane manner) to treat dict inside lists differently. Anyway as I
>>>> mentioned earlier we could use:
>>>>
>>>
>>> Oops, I may have used an incorrect syntax or idea (or both).
>>>
>>>>     avocado:
>>>>         streams:
>>>>             1: ssh://foo.example.com
>>>>
>>>> or:
>>>>
>>>>     avocado:
>>>>         streams:
>>>>             1:
>>>>                 type: remote
>>>>                 host: foo.example.com
>>>>
>>>
>>> This one looks good IMO.  I'm all in for more explicitly naming **when**
>>> you go to the lengths of defining them.
>>>
>> Yes, plus we do use the OrderedDicts so we can still let people use:
>>
>>     self.streams[0:4]
>>
>> together with:
>>
>>     self.streams["server"]
>>
>> just beware the:
>>
>>     self.streams["1"]
>>
>> is different then:
>>
>>     self.streams[1]
>>
>> and the node-names (therefor even the 1: in streams) becomes string in
>> Avocado, so it actually works well and in the example above the:
>>
>>     self.streams["1"] == self.streams[0]
>>
>
> Nice, looks like OrderedDicts, or an even "smarter" version, could be
> the basis or our slicing support.
>
>>>> or:
>>>>
>>>>     avocado:
>>>>         streams:
>>>>             - type: remote
>>>>               host: foo.example.com
>>>>
>>>> Another thing is I'd probably prefer names to ints so "1" or "server" or
>>>> "worker1" etc, which goes nicely with the first 2 examples. The last
>>>> example goes well with indexes, but it starts with 0 (which would be my
>>>> recommendation anyway if we decided to go with indexes).
>>>>
>>>
>>> I think I mentioned somewhere "if only integers"...  I actually share
>>> your fondness of names.  I did not say it explicitly, but I think we can
>>> support both, as the slicing examples make a lot of sense to me.
>>>
>>> But, the slicing examples can be expanded to its own dialect, such as
>>> supporting regexes.  For instance, `self.streams["client-\d+"]` makes a
>>> lot of sense IMO.
>>>
>> Yes, this makes sense and is quite simple to do...
>>
>>>>>
>>>>> Would result in the execution of ``import mylib; mylib.action()``
>>>>> in a Python interpreter on host ``foo.example.com``.
>>>>>
>>>>> If test environments are refered to on a test, but have not been
>>>>> defined
>>>> If test environments are __referred__ to on a test, but have not been
>>>> defined
>>>>
>>>
>>> OK, thanks.
>>>
>>>>> in the outlined schema, Avocado's ``streams`` attribute implementation
>>>>> can use a default Execution Stream implementation, such as a local
>>>>> process
>>>>> based one.  This default implementation can, of course, also be
>>>>> configured
>>>>> at the system and user level by means of configuration files, command
>>>>> line
>>>>> arguments and so on.
>>>>>
>>>>> Another possibility is an "execution stream strict mode", in which no
>>>>> default implementation would be used, but an error condition would be
>>>>> generated.  This may be useful on environments or tests that are
>>>>> really tied to their execution stream types.
>>>> I'd solve this by supporting `__len__` where `len(self.streams)` should
>>>> report number of defined streams.
>>>>
>>>> Note the number of defined streams changes based on how many streams are
>>>> defined __OR__ used by the test. So:
>>>>
>>>>     avocado:
>>>>         streams:
>>>>             - 0:
>>>>             - 1:
>>>>
>>>>
>>>>     len(self.streams)  => 2
>>>>     self.streams[5].run(cmd)
>>>>     len(self.streams)  => 6
>>>>
>>>> where the streams 2-5 are the default streams.
>>>>
>>>
>>> I have mixed feelings here.  In my understanding, all of the streams are
>>> created on demand, that is, when they're used.  A configuration that
>>> defines thousands of them will not cause an empty test (think of
>>> `passtest.py`) to initialize them.
>> Yes, that is the point. `__len__` reports the "defined", not necessarily
>> "initialized" streams.
>>
>> Adding streams dynamically is definitely must-have-feature and one note
>> the code above would actually only have the `streams[5]` initialized,
>> the rest would wait for the first usage.
>>
>
> OK, I'm fine with that.
>
>>>
>>> But, the meaning of `len(self.streams)`, that is, defined or
>>> initialized, is something we can further discuss later.
>>>
>> Sure, another possibility would be to support `streams.defined_streams`
>> but I think the `__len__` would make more sense. It's similar to a list:
>>
>>     streams = list(params.get("streams", "/avocado/*")
>>     len(streams) == 2
>>     streams.extend([2,3,4,5])
>>     len(streams) == 6
>>
>
> Sure, as long as we document that `len()` returns the defined streams,
> it makes perfect sense to me.
>
Currently defined streams (as I'd like to be able to register new ones 
for automatic postprocess. It should be simple while powerful)

>>>>>
>>>>> Intercommunication Test Example
>>>>> -------------------------------
>>>>>
>>>>> This is a simple example that exercises the most important aspects
>>>>> proposed here.  The use case is to check that different hosts can
>>>>> communicate among themselves.  To do that, we define two streams as
>>>>> parameters (using YAML here), backed by a "remote" implementation::
>>>>>
>>>>>   avocado:
>>>>>      streams:
>>>>>       - 1:
>>>>>           type: remote
>>>>>           host: foo.example.com
>>>>>       - 2:
>>>>>           type: remote
>>>>>           host: bar.example.com
>>>>>
>>>>> Then, the following Avocado Test code makes use of them::
>>>>>
>>>>>   from avocado import Test
>>>>>   from avocado.streams.code import shell
>>>>>
>>>>>   class InterCommunication(Test):
>>>>>       def test(self):
>>>>>           self.streams[1].run(shell("ping -c 1 %s" %
>>>>> self.streams[2].host))
>>>>>           self.streams[2].run(shell("ping -c 1 %s" %
>>>>> self.streams[1].host))
>>>>>           self.streams.wait()
>>>>>           self.assertTrue(self.streams.success)
>>>> Brainstorming here, how about letting `wait` raise exception when it
>>>> fails unless we use `wait(ignore_failure)`. The exception would contain
>>>> all the information so it'd be THE exception which failed the test?
>>>>
>>>
>>> Yep, this is a valid question.  I think the answer will depend on how
>>> much we want the **test** result to be bound to what happens on the
>>> streams.  Right now it's obvious that I decided to keep them pretty much
>>> separate.
>>>
>> Sure, the more I think about it I also share your vision.
>>
>
> Good!
>
>>>> As for the `streams.success`, I guess it'd be a property, which would go
>>>> through all streams results, and report `any(_.failure for _ in
>>>> self.streams)`, right?
>>>>
>>>
>>> Exactly.
>>>
>>>>>
>>>>> The ``streams`` attribute provide a aggregated interface for all the
>>>>> streams.
>>>>> Calling ``self.streams.wait()`` waits for all execution streams (and
>>>>> their
>>>>> block of code) to finish execution.
>>>>>
>>>>> Support for slicing, if execution streams names based on integers only
>>>>> could
>>>>> be added, allowing for writing tests such as::
>>>>>
>>>>>   avocado:
>>>>>      streams:
>>>>>       - 1:
>>>>>           type: remote
>>>>>           host: foo.example.com
>>>>>       - 2:
>>>>>           type: remote
>>>>>           host: bar.example.com
>>>>>       - 3:
>>>>>           type: remote
>>>>>           host: blackhat.example.com
>>>>>       - 4:
>>>>>           type: remote
>>>>>           host: pentest.example.com
>>>>>
>>>>>   from avocado import Test
>>>>>   from avocado.streams.code import shell
>>>>>
>>>>>   class InterCommunication(Test):
>>>>>       def test(self):
>>>>>           self.streams[1].run(shell("ping -c 1 %s" %
>>>>> self.streams[2].host))
>>>>>           self.streams[2].run(shell("ping -c 1 %s" %
>>>>> self.streams[1].host))
>>>>>           self.streams[3].run(shell("ping -c 1 %s" %
>>>>> self.streams[1].host))
>>>>>           self.streams[4].run(shell("ping -c 1 %s" %
>>>>> self.streams[1].host))
>>>>>           self.streams.wait()
>>>>>           self.assertTrue(self.streams[1:2].success)
>>>>>           self.assertFalse(self.streams[3:4].success)
>>>> As mentioned earlier I'd prefer names to indexes, anyway I see the
>>>> indexes useful as well. How about supporting a name or index?
>>>>
>>>
>>> Yep, also thought of that.
>>>
>>>> As for the slices, I'd prefer list-like slice to stream-like slices as
>>>> it'd be more natural to me to interact with a list of individual streams
>>>> rather than a Stream object with a limited subset of streams. Anyway
>>>> that's a matter of taste and I can definitely live with this as well.
>>>>
>>>
>>> See my previous comments.
>>>
>>>> Now about this example, it's really limited. Again you are hard-coding
>>>> the scenario and changing it is really complicated. I'd prefer something
>>>> like:
>>>>
>>>>     self.streams[0].run(server_cmd)
>>>>     self.streams[1:].run(contact_server_cmd)
>>>>     self.assertTrue(self.streams.success)
>>>>
>>>
>>> Real tests will probably (hopefully) use better (symbolic) names.  The
>>> goal here is to focus on the mechanisms, which on yours and on my
>>> version are identical.
>>>
>>>>>
>>>>> Support for synchronized execution also maps really well to the
>>>>> slicing example.  For instance, consider this::
>>>>>
>>>>>   from avocado import Test
>>>>>   from avocado.streams.code import shell
>>>>>
>>>>>   class InterCommunication(Test):
>>>>>       def test(self):
>>>>>           self.streams[1].run(shell("ping -c 60 %s" %
>>>>> self.streams[2].host)
>>>>>           self.streams[2].run(shell("ping -c 60 %s" %
>>>>> self.streams[1].host))
>>>>>           ddos = shell("ddos --target %s" self.streams[1].host)
>>>>>           self.streams[3:4].run(ddos, synchronized=True)
>>>>>           self.streams[1:2].wait()
>>>>>           self.assertTrue(self.streams.success)
>>>>>
>>>>> This instructs streams 1 and 2 to start connectivity checks as soon as
>>>>> they **individually** can, while, for a full DDOS effect, streams 3
>>>>> and 4 would start only when they are both ready to do so.
>>>> OK so this is about before-start-synchronisation. Well again, I'm not
>>>> much fond of boundling the streams so I'd prefer allowing to define the
>>>> workload (which returns when the workload is ready), trigger it (which
>>>> triggers it and reports immediately) and then wait for it. The
>>>> difference in usage is:
>>>>
>>>>
>>>>     self.streams[3].run(ddos, stopped=True)
>>>>     self.streams[4].run(ddos, stopped=True)
>>>>     self.streams[3].start()
>>>>     self.streams[4].start()
>>>>
>>>> The result is the same (unless you create processes per each stream and
>>>> in `self.streams[3:4]` you use signals to synchronize the execution) but
>>>> it allows greater flexibility like synchronizing other tasks then just
>>>> streams...
>>>>
>>>> Or we can create methods `establish(cmd)`, `start()` and `wait()` which
>>>> might better describe the actions.
>>>
>>> I think you missed my point here.  Streams #3 and #4, in my example,
>>> wait for *each other*.  Using a mechanism such as barriers.
>>>
>> Oh you mean that the `synchronized=True` means it'd make a `barrier`
>> mechanism available between those two streams and you'd be able to use
>> this barrier from the streams? That would be rather limiting as you
>> might want to create multiple barriers between several streams. I think
>> you should re-describe your synchronization here as well as the
>> structure changed a bit...
>>
>> In my RFC I basically passed the information about barriers by test
>> params, which is obviously not available here as the Code could be
>> anything. Anyway a barrier requires:
>>
>> 1. name
>
> Name on my example is implicit, as it can be created from the unique
> names of the stream themselves.
>
If you say the name is implicit than it means you are only talking about 
all-to-all single barrier. This is (usually) true for 2 streams, but 
with 3+ streams you usually need different named barriers as not all of 
the streams have to sync all of the time. Please take a deeper look at 
the example I posted and try to incorporate it in your v2.

>> 2. server
>
> Also implicit, as most use cases should be able to use the default
> barrier server provided by `self.streams`.  To be more precise, an also
> on-demand barrier server running alongside the avocado test process
> (what you refer to as as "main", later).
>
>> 3. number of clients
>>
>
> Sure, but I don't see how **code** run on streams have to care about
> this.  The idea is that they will only have their `run()` methods called
> by the respective execution streams when the barrier is lifted (say, all
> clients have checked in).
>
> The point here is that individual stream execution in a "slice" will be
> **synchronized** from the outside of the **block of code**.  How the
> synchronization is done, is a different matter.
>
>> When creating synchronized streams by a slice, you have all of these
>> available, but it might not be always like that and you might want to
>> define different groups of barriers, for example:
>>
>> ssh:
>>
>>     start_ssh()
>>     barrier("ssh_started", worker, 2)
>>     barrier("worker_finished", main, 3)
>>
>
> These 3 lies, IIUC, are "block of code" content, right?
Yes, this is a stream1 called "ssh"

>
>> http:
>>
>>     start_http()
>>     barrier("http_started", worker, 2)
>>     barrier("worker_finished", main, 3)
>
> Same here.
>
Yep, stream2 called "http"

>>
>> worker:
>>
>>     barrier("ssh_started", worker, 2)
>>     barrier("http_started", worker, 2)
>>     connect_to_both()
>>     barrier("worker_finished", main, 3)
>>
>
> Same here for these 4 lines, right?
>
Yes, stream3 called "worker".

Basically what is expected here to happen is that the worker waits for 
"ssh" to be started, then it waits for "http" to be started, then it 
uses them and after finishing it notifies all of the clients to let them 
finish.

The main difference between the first 2 barriers and the last one is 
that the first 2 barriers are independent on the other process while the 
last one is all-to-all.

To demonstrate the need for other than all-to-all barriers let me modify 
the example:

worker:

      barrier("ssh_started", worker, 2)
      connect_to_ssh_and_do_some_business()
      barrier("http_started", worker, 2)
      connect_to_both()
      barrier("worker_finished", main, 3)

Now imagine the "http" start takes a long time as it can consist of 
configuring a database, bootstraping the server etc. In this example you 
have 3 streams where concurrently "ssh" and "http" are starting. When 
"ssh" starts you immediately start doing some business there (as it also 
can be time-expensive task) and only when this finishes you check 
whether the "http" already started.

This particular case is still possible to write with all-to-all 
barriers, it just becomes sub-optimal:

worker:

      barrier()
      connect_to_ssh_and_do_some_business()
      connect_to_both()
      barrier()

As now the `connect_to_ssh_and_do_some_business` waits for "http" to be 
started even though it does not need it.

Anyway I'm looking forward to the v2, I'd be fine with implicit 
all-to-all barriers as a standard feature initialized automatically (if 
asked for) if we put the generic implementation in `avocado.utils` so 
users could use it freely in advanced use-cases like the one I explained 
here.

...

Thank you for the follow up,
Lukáš

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 502 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20170418/19e5e34e/attachment.sig>