[Avocado-devel] Ways to solve asset fetcher clashes

Tue Jun 12 18:27:50 UTC 2018

Hello guys,

As Cleber pointed out on the release meeting, we are struggling with our asset fetcher. It was designed with one goal, to cache arbitrary files by name and it fails when different asset use the same name as it simply assumes they are the same thing.

Let's have a look at this example:

    fetch("https://www.example.org/foo.zip")
    fetch("https://www.example.org/bar.zip")
    fetch("http://www.example.org/foo.zip")

Currently the third fetch simply uses the "foo.zip" downloaded from "https" even though it could be a completely different file (or downloaded from completely different url). This is good and bad. It's good when you're downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are downloading "vmlinuz" which is always called "vmlinuz" but comes from a different subdirectory, it might lead to big problems.

From this I can see two mods of assets, anonymous and specific. Instead of trying to detect this based on combinations of hashes and methods, I'd suggest being explicit and either add it as extra argument, or even create new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` would be the current implementation and we still need to decide on `SpecificAsset` implementation. Let's discuss some approaches and use following assets in examples:

Current implementation
----------------------

Current implementation is Anonymous and the last one simply returns the "foo.zip" fetched in first fetch.

Result:

    foo.zip
    bar.zip    # This one is fetched from "https"

+ simplest
- leads to clashes

Hashed url dir
--------------

I can see multiple options. Cleber proposed in https://github.com/avocado-framework/avocado/pull/2652 to create in such case dir based "hash(url)" and store all assets of given url there. It seems to be fairly simple to develop and maintain, but the cache might become hard to upkeep and there is non-zero possibility of clashes (but nearly limiting to zero).

Another problem would be concurrent access as we might start downloading file with the same name as url dir and all kind of different clashes and we'll only find our all the issues when people start extensively using this.

Result:

    2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
    2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
    6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http

+ simple to develop
+ simple to maintain
- possible clashes
- hard to browse manually
- API changes might lead to unusable files (users would have to remove files manually)

sqlite
------

Another approach would be to create sqlite database in every cache-dir. For anonymous assets nothing would change, but for specific assets we'd create a new tmpdir per given asset and store the mapping in the database.

Result:

    .avocado_cache.sqlite
    foo-1X3s/foo.zip
    bar-3s2a/bar.zip
    foo-t3d2/foo.zip

where ".avocado_cache.sqlite" would contain:

    https://www.example.org/foo.zip  foo-1X3s/foo.zip
    https://www.example.org/bar.zip  bar-3s2a/bar.zip
    http://www.example.org/foo.zip   foo-t3d2/foo.zip

Obviously by creating a db we could improve many things. First example would be to store expiration date and based on last access to db we could run cache-dir upkeep, removing outdated assets.

Another improvement would be to store the downloaded asset hash and re-download&update hash when the file was modified even when user didn't provided hash.

+ easy to browse manually
+ should be simple to expand the features (upkeep, security, ...)
+ should simplify locks as we can atomically move the downloaded file&update db. Even crashes should lead to predictable behavior
- slightly more complicated to develop
- "db" file would have to be protected

Other solutions
---------------

There are many other solutions like using `$basename-$url_hash` as the name or using `astring.string_to_safe_path` instead of url_hash and so on. We are open to suggestions.

Questions
=========

There are basically two questions:

1. Do we want to explicitly set the mode (anonymous/specific), in which way and how to call them
2. Which implementation we want to use (are there existing solutions we can simply use?)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20180612/542f5d3f/attachment.sig>