[Avocado-devel] Multiplexer and params retrieval

Mon Mar 2 21:30:38 UTC 2015

On Wed, Feb 25, 2015 at 03:54:34PM +0100, Lukáš Doktor wrote:
> Hello World,
> 
> I'm working on the https://trello.com/c/vKxylMFN/115-multiplexer-mechanism-for-tests-to-retrieve-variables
> card and having some design decision concerns. I'd like to ask you for your
> opinions, examples, concerns or demands :D
> 
> 
> Multiplexer
> ===========
> 
> Ultimate goal
> -------------
> 
> Possibility to specify test parameters, allowing simple upstream-downstream
> modification, supplying different files on command line or even injecting
> nodes/variables from cmdline to use different params for different runs.
> 
> Additionally we must avoid params name clashes.
> 
> Ideally this should be simple enough to be understood without any prior
> knowledge.
> 
> 
> Tree
> ----
> 
> Let's say tree is the way we want to go:
> 
> -*- by_something -*- A
>  |                \- B
>  \- by_else -*- A
>              \- B
> 
> Produces:
> 1. /by_something/A + /by_else/A
> 2. /by_something/A + /by_else/B
> 3. /by_something/B + /by_else/A
> 4. /by_something/B + /by_else/B
> 
> and we can ask for:
> params.get(key) => can generate errors when multiple matches
> params.get(key, path=/by_something)
> params.get(key, path=/by_else)

OK, more about this at the end.

> 
> Upstream-downstream
> -------------------
> 
> One way I come up with is to use default multiplex files location:
> 
> /etc/avocado/config.yaml    => avocado config (could be parsed from INI)
> $AVOCADO/global.yaml        => default variables (use /etc/avocado/
> instead?)
> $PLUGIN                     => plugins should be able to specify default mux
> too
> $TEST_DATA/multiplex.yaml   => per-test additional combinations
> 
> Not all of the files needs to be proceed. This should depend on the cmdline
> option, which either disables multiplexation, uses only tests options,
> (maybe) uses global options but multiplex only parts specified in test,
> multiplex everything.
> 
> We might also consider using `10_` prefix and automatically load all files
> matching the name (eg.: when files "10_multiplex.yaml" and
> "20_multiplex.yaml" exists, we should include booth of them. That way
> extending up-down is just adding an additional file, which shouldn't clash
> with our default files. Thus only fast-forward merges are required in
> downstream projects). The same applies for "-m" cmdline option, when `-m
> $PATH/10_myfile.yaml` is set, we should search for other `\d\d_myfile
> files`.

I guess the idea to add number prefixes is because the order
these files are included matter, is that right?

Otherwise, I would say why not just handle $TEST_DATA/*.yaml as
multiplexing files (or $TEST_DATA/mux/*.yaml).

> 
> 
> Structure
> ---------
> 
> This one is tricky and it heavily affects the usage.
> 
> 1. we might not care about the structure (allow everything for advanced
> users)
> 2. we can precisely define the structure (restrict everyone, but simplify
> usage)
> 3. we can automatically create locations (heavily affect tree modifications)
> 4. combine the methods
> 
> 
> Allow everything
> ~~~~~~~~~~~~~~~~
> 
> This one is the most flexible but can surprise users. They are responsible
> for not touching elsewhere used variants, but if they succeed, they can do
> absolutely everything.
> 
> On the other hand it might be harder to write the tests as you need to know
> the namespace you want to get the variable from. Also with multiple
> multiplex files this can get nasty.
> 
> 
> Precise structure
> ~~~~~~~~~~~~~~~~~
> 
> Let me demonstrate this on example (+ => !join, * => !mux):
> 
> -+- config -+- datadir --- paths
>  |          +- sysinfo
>  |          \- runner
>  +- plugins -*- virt -*- hw -*- disk -*- virtio
>  |                    |      |        \- ide
>  |                    |      \- nic -*- virtio
>  |                    |              \- rtl8139
>  |                    \- os ...
>  +- tests -*- HASH1
>  |         \- HASH2

What do you mean by HASH1 and HASH2?

>  \- self
> 
> This is a basic global structure. When we multiplex the file, we get
> (instead of full path I used *+leaf)
> 1. */paths + */sysinfo + */runner + */virtio + */virtio + */os* + /self
> 2. */paths + */sysinfo + */runner + */ide + */virtio + */os* + /self
> ...

So if I understand it correctly, we would provide a pre-populated
default tree with the following branches:

 /
 /config/
 /plugins/
 /tests/
 /self/

> 
> /tests are special. It's parsed out of the cmdline's urls and/or specified
> by mapping file. For each run current test just extends the existing
> "/self". That way we can specify per-test variants.
> 
> All multiplex files goes automatically into "/self" by default. When we
> specify !using, we can override other parts too (?except /config).
> 
> 
> Automatic locations
> ~~~~~~~~~~~~~~~~~~~
> 
> Similar to previous only each file (which is not using !using) goes into
> `/self/$UID` location.
> 
> This brings one benefit and drawback. It supports simple files:
> 
> # simple.yaml
> short:
> medium:
> long:
> 
> Which works fine with multiple independent files:
> 
> # simple2.yaml
> builtin:
> bash:
> 
> But makes it impossible to extend this file from another file:
> 
> # extend.yaml
> longest:
> 
> which results in (without the simple2.yaml):
> 1. */short + */longest
> 2. */medium + */longest
> 3. */long + */longest
> 
> People could use !using to avoid this problem, but that somehow workarounds
> the automatic location and creates the same structure as 1 or 2.
> 
> Also it brings problems to params.get(key, "/self/by_something") as in
> reality it would become "/self/$UID/by_something". Should we search for
> "*/by_something" instead? Or differentiate between "/by_something" vs.
> "by_something" and in case of relative path prepend "/self/*"?.

I'm sorry, I couldn't follow you here.

> 
> 
> Combined
> ~~~~~~~~
> 
> Possibilities are unlimited.
> 
> 
> Command line
> ~~~~~~~~~~~~
> 
> We also need to consider where -m files go (when not using !using). I'd vote
> for /self by default, /self/$UID in case of automatic location.
> 
> Also we might allow -m [$location:]$file to dynamically extend the location.
> (I tend to prefer this to hard-coded locations inside the file). Take
> another look at the "Structure.Automatic locations" simple.yaml and
> simle2.yaml. Instead of automatic handling we can just let users say:
> 
> -m /self/by_duration:simple.yaml -m /self/by_tool:simple2.yaml -m
> /self/by_duration:extend.yaml
> 
> Which does booth; extends the simple.yaml by extend.yaml and adds second set
> of variants from simple2.yaml.
> 
> This itself can help the basic users to create variants ad-hoc while not
> tying our hands to write complex yaml files.

OK, sounds good.

> 
> 
> Getting the params
> ------------------
> 
> OK when we have the tree structure parsed into variants, we execute the
> tests using the params extracted from variants:
> 
> Let's say our variant1 is:
> /config* {...}
> /plugins/virt/hw/disk/virtio {}
> /plugins/virt/hw/nic/virtio {}
> /plugins/virt/os/linux/fedora/21/64 {file:img.qcow2, kickstart:ks.ks, ...}
> /self {file:boot.py, tag:, exec:sleep 1}
> 
> # the plugin uses
> params.get_leaves(/plugins/virt/hw) => [*/disk/virtio, */nic/virtio]
> params.get_leaf(/plugins/virt/hw/disk) => virtio
> params.get_leaf(/plugins/virt/hw/nic) => virtio
> params.get(file, /plugins/virt/os) => img.qcow2

I understand the value in such APIs for debugability purposes.
What else? Why a plugin would use such calls?

And by plugin, I guess you mean 'run' plugin, or are you thinking
of something else?

> 
> # in test we use
> params.get(exec) => sleep 1
> 
> 
> Now imagine that second test defines multiple variants:
> 
> ...
> /self {file:boot.py, tag:}
> /self/by_command/sleep {command:sleep 1}
> 
> now we use params.get(command, /self/by_command) to avoid clashes with
> variables from /self.
> 
> 
> Clashes
> ~~~~~~~
> 
> Simple example:
> 
> /self/by_variable/oneliner {variable: "Single line example"}
> /self/by_other/something {variable: 123}

Just to make sure I'm following: the above is a variant, as
available to the test. 'variable:' is the name of the leaf, which
is duplicated in both branches.

> 
> params.get('variable') => Exception("Multiple occurances found")
> params.get('variable', '/self/by_variable') => "Single line example"
> 
> Let's try this:
> 
> /self {timeout: 10}     # Test's defaults
> /self/by_duration/long {timeout: 60, sleep: 30}
> /self/by_tool/builtin {timeout: 10}@/self {}

What's '@/self {}'?

> 
> params.get(timeout) => raises exception as it's defined multiple times, but
> is it what we want?

I think so... if there are conflicts, we better not make guesses,
unless we provide an explicit option to change the behavior of
the get() method.

(but see my proposal for scope/resolution methods below)

> 
> + namespaces are really separated
> - we can't override defaults inside variants
> 
> We can use params.get(timeout, /self/by_duration), but that would fail if we
> remove the multiplex file with /by_duration (as there would be no
> */by_duration in params).
> 
> We can make /self special and say it contains defaults. So in case path or
> key are not found, we look there:
> 
> /self/by_duration/long {timeout: 60, sleep: 30}
> /self/by_tool/builtin
> ---
> /defaults {timeout: 10}
> 
> params.get(timeout) => 60   # if in other run timeout is not found, uses 10
> 
> 
> This solution also has some holes. Users must make sure not to use keys from
> /defaults multiple times inside variants:
> 
> /self/by_duration/long {timeout: 60, sleep: 30}
> /self/count/many {timeout: 120, count: 3}
> ---
> /defaults {timeout: 10, count: 1}
> 
> params.get(timeout) => Exception("Multiple occurances")
> 
> ^^ I know this example makes little sense, but it demonstrates possible
> issues.

I think we're trying to deal with hypothetical examples which
just add complexity... I suggest to fail with error in such
corner cases for now, until we better understand the problem and
the consequences.

> 
> Additionally we should consider another problem. We have similar example as
> before:
> 
> params.get(count, /self/count) => 3
> 
> in next run we remove the multiplex file specifying /self/count* leafs
> 
> params.get(count, /self/count) => Exception("Path not found")
> 
> should we consider supplying /defaults if path not found? (=> 1)

params.get(count, /self/count, fallback_to_default=True)

(but see my idea about scope and resolution order down below)

> 
> Logging
> -------
> 
> While thinking about optimization and usage, do we need to log all the
> config variables? I remember how hard is it to find something when looking
> at Autotest's output. So why not to log first occurrence of params.get()
> instead?
> 
> ...
> DEBUG: PARAM $name ($path) => $value
> ...
> 
> 
> We can log them:
> 
> 1. in main log when it's inquired (we might need to use grep to see them
> all)
> 2. in main log after the test (in case of panic we might not see them)
> 3. in separate file (sorted alphabetically)
> 4. in separate file (sorted chronologically)
> 5. combination, eg: log all of them into special file and after the test
> remove
>    the file and put only used one there (or in main log).
> 
> Benefits:
> 
> 1. Speed&mem (some leafs might not been processed)
> 2. See what variables were used and when (first inquire)
> 3. Shorter logs
> 
> Drawbacks:
> 
> 1. Necessity to look into another file or at the end instead of beginning.

Whatever you decide, I suggest you don't worry about speed/mem
yet. What matters at this point is usability, optimization should
be at the very very bottom of your priorities for now. Most of us
don't even understand all this stuff yet. We need the best and
most convenient logging and tools you can think of before any
kind of optimization.

> 
> Results
> -------
> 
> Ok, this was long as multiplexer is something new and very flexible right
> now. Let me summarize it by providing example of what I'd choose, please
> feel free to disagree, comment and give advises. (also I'm aware the
> previous brain-dump has big holes in it, but I don't want to avoid rewriting
> this using what I realize while writing it).
> 
> Structure (optional)
> ~~~~~~~~~~~~~~~~~~~~

What do you mean by optional?

> 
> /config     # /etc/avocad/... (using yaml to allow multiplexation)
> /plugins    # When plugin asks for it (eg. `test.VirtTest` would process
>             #                          the virt's yaml file before
>             #                          test execution.

I don't understand this plugin use case. Can you please
elaborate?

> /self       # Default location when -m used

Can you precisely define what -m is for and how it's used?

> /test       # Parsed from cmdline urls (or by mapping file),
>             # each variant extended of `Test.get_variants()`
>             # not available inside the test, only during test parsing...
> 
> Demonstration
> ~~~~~~~~~~~~~
> 
> Instead of drawing trees I'll write yaml representing the tree
> 
> # /etc/avocado/avocado.yaml
> !using : /config
> datadir:
>     log: "/tmp/avocado"
> 
> # myfile.yaml
> key: value
> by_something:
>     variant_a:
>     variant_b:
> 
> avocado run passtest.py boot.py -m myfile.yaml
> 
> # After params parsing looks like this:
> config:
>     datadir:
>         log: "/tmp/avocado"
> self:
>     key: value
>     by_something:
>         variant_a:
>         variant_b:
> tests:
>     passtest.py:
>         id: passtest.py
>     boot.py:
>         id: boot.py

ok, following.

> 
> # Then it starts executing tests
> # passtest.py:
> # 1. executes Test.get_variants()
> self:
>     id: passtest.py
> # 2. merges this tree and the original one (without tests)
> config: # unchanged
> self:
>     id: passtest.py
>     key: value
>     by_something:
>         variant_a:
>         variant_b:
> 
> # 3. executes all variants (2)

ok, following

> 
> # boot.py:
> # 1. executes Test.get_variants()
> plugins:
>     virt:
>         hw:
>             disk:
>                 ide:
>                 scsi:

Where did this come from?

> self:
>     id: boot.py
> # 2. merges this tree and the original one (without tests)
> config: # unchanged
> plugins: # unchanged
> self:
>     id: boot.py
>     key: value
>     by_something:
>         variant_a:
>         variant_b:
> 3) executes all variants (4)

I'm lost, probably missing the context... why passtest.py and
boot.py behave differently?

> 
> This requires not to reuse keys from /self as all children would share them
> and params.get(key) would raise Exception. In my opinion this is sensible
> and it simplifies the usage.
> 
> 
> Params
> ~~~~~~
> 
> params.get(key, default=None, strict=False, path=None)
>   * path:
>     * None  => "/self[/.*]?"
>     * PATH  => "/self/[/.*]?PATH[/.*]?"

I would not use PATH as an intermediary part. If a relative path
is provided, it means the resolution starts from the right to the
left of the variable.

So /a/b/c/d/e/foobar can be retrieved either via

params.get(foobar, path="e")
params.get(foobar, path="b/c/d/e")

But not via

params.get(foobar, path="c")

Still regarding the usage of local variables (the ones that don't
start with /, or have no path=), I discussed resolution orders
and scopes with you the other day, maybe you remember, or maybe
my proposal was not clear, so I'll explain it here:

Just like we have scope when we're programming, we would have
scope when looking for multiplex variables. Anything that doesn't
start with / would be local. Then we would define the resolution
order of the local scope and return the first occurrence inside
that scope (and error out with duplication only if there are two
occurrences that match the same relative path inside that the
same level).

For example, the scope for local variables could be:

   1. /self
   2. default value provided by the test
   3. ERROR (not found)

OR

   1. /self
   2. /tests
   3. /config
   4. default value provided by the test
   5. ERROR (NOT FOUND)

OR:

   1. /self
   2. /tests
   3. /config
   4. /
   5. default value provided by the test
   6. ERROR (NOT FOUND)

OR:
   1. /self/
   2. /whatever-file-you-provided-in-the-command-line
   3. /whatever-file-you-inherited-from-upstream
   4. default value provided by the test
   5. ERROR (NOT FOUND)

In other words, we simplify the problem and make a clear
separation between the API that retrieves the variables and the
source of the variable. The test writer doesn't even need to
understand all of the details.

And better of all: we can explain the whole concept in just a
couple of paragraphs, without need to resort to a lot of
exceptions that leak implementation details (remember one of the
requirements for this API is for it to be independent of the
multiplex mechanism).

Now, I of course may be oversimplifying the whole thing, so do
you see any problems with the layout above? AFAICS, we can still
apply the use-cases you've presented to this layout.

>     * /PATH => "/PATH[/.*]?

Correct.

>     * '*'   => .*    # eg: */PATH => .*/PATH[/.*]?
>                      # eg: PA*TH => "/self/[/.*]?PA.*TH[/.*]?"

I would use '*' as blobs, similar to file-system resolution. So
you almost got it right:

     * '*'   => .*    # eg: */PATH => .+/PATH[/.*]?
                      #     (won't match /PATH)
                      # eg: PA*TH => "/self/[/.*]?PA.*TH[/.*]?"

                      (but see note about resolving local
                      variables from the right to the left above)

>   * strict: exception on missing value?
>   * default: default value when not found
>   * Exception when multiple matches of !different! value
>     In case values are the same, use them (inherited values)
> 
> params.get_leaves(path) => return all leaves matching path
> 
> params.get_leaf(path) => return single leaf or exception when 0 or 2+

> 
> 
> Locations
> ~~~~~~~~~
> 
> 1. /etc/avocado/avocado.yaml + conf.d/* files (usually !using: /config)
> 2. API for plugins to inject tree (during init or Test.get_variants())
> 3. cmdline $test1 $test2 $test3 => /tests/HASH {url: $test1}, ...
>    extended by Test.get_variants() during execution.
> 4. "-m $path" extends /self
> 5. "-m $path:$file" extends $path (when $path starts with '/', extends
>                                    global, otherwise injects into /self)
> 6. when -m 10_file.yaml specified, all files matching \d\d_file.yaml
>    files are loaded.

Hmm, this would hackish, no? The reason many systems add a number
prefix to files is to make them easy to sort. Why not go with
'-m *file.yaml'?

> 
> 
> Output
> ~~~~~~
> 
> 1. Dump first occurrence of the params.get() in the main log with common
> prefix to be able to grep for it:
> 
> ...
> INFO | Starting machine
> DEBUG| PARAM: /plugins/virt/os/fedora/21/64 image = image.qcow2
> DEBUG| Logging into machine...
> DEBUG| PARAM: /plugins/virt/os/fedora/21/64 password = 123456
> DEBUG| PARAM: /self exec = sleep 1
> DEBUG| Executing binary "sleep 1"
> ...
> 
> 
> Execution
> ~~~~~~~~~
> 
> --mux NOMUX|DEFAULT|ALL
> 
> where plugins and tests should look at /config/multiplex/mux value to
> distinguish between variants.
> 
> NOMUX => only execute single variant of the test with defaults
> DEFAULT => basic multiplexation of the file
> ALL => multiplex everything
> 
> NOTE: It's possible to say --nomux -m /:$FILE to use only custom multiplex
> file (/config and /test would be merged).
> NOTE2: --mux DEFAULT with virt test would mean use default
> Test.get_variants() but don't parse the $virttest.yaml.
> 

I think these parameters would match perfectly well to the scope
and resolution order idea I proposed above... what do you think?

> 
> Ugh, bzzz, grrr, kukikuki, agh... please feel free to comment or just
> provide examples of where it fails and why. Examples are important.
> 

Thanks.
   - Ademar

-- 
Ademar Reis
Red Hat

^[:wq!