[Avocado-devel] Ways to solve asset fetcher clashes

Thu Jun 21 18:46:11 UTC 2018

Hi Lukáš,

On 06/18/2018 11:50 PM, Cleber Rosa wrote:
> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
[...]
>> Hashed url dir
>> --------------
>>
>> I can see multiple options. Cleber proposed in https://github.com/avocado-framework/avocado/pull/2652 to create in such case dir based "hash(url)" and store all assets of given url there. It seems to be fairly simple to develop and maintain, but the cache might become hard to upkeep and there is non-zero possibility of clashes (but nearly limiting to zero).
>>
> 
> Definitely my goals were to keep it simple, and to fix the one issue at
> hand.  My motivations are based on the idea that Avocado should be able
> to help test writers with common tasks in tests, but it's hardly going
> to contain in itself full-blown and complete solutions to generic
> problems, downloading and caching files.
> 
> Now, I'd be nice to describe the possible clashes (besides, of course
> possible hash collisions).  I can think of one:
> 
>  * Anonymous with url "http://foo/bar.zip"
>  * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54"
> 
> The clash exists because sha1("http://foo/bar.zip") is
> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
> 
>> Another problem would be concurrent access as we might start downloading file with the same name as url dir and all kind of different clashes and we'll only find our all the issues when people start extensively using this.
> 
> I'm not sure I follow.  Right now the name of the *downloaded* files are
> temporary names.  Then, a lock for the destination name is acquired,
> verification is done, and if successful, copied into the destination name.
> 
>>
>> Result:
>>
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>>     6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http
>>
>> + simple to develop
>> + simple to maintain
> 
> Agreed.
> 
>> - possible clashes
> 
> Agreed, ideally there should be none.
> 
>> - hard to browse manually
> 
> True.  As a note, I'd trade the ease to browse the cache for a simpler
> overall implementation, because the "browsing" could then be done with
> simple scripts reusing the same implementation (like a contrib script
> importing from "avocado.utils.asset").

As an end-user I agree with Cleber that there is no interest in browsing
the cache.

> 
>> - API changes might lead to unusable files (users would have to remove files manually)
> 
> This too, but I really think this is minor.  We may want to include the
> cache stability along the same lines of the API stability (on LTS
> releases) but I don't think we made promises about it yet.
> 
>>
>>
>> sqlite
>> ------
>>
>> Another approach would be to create sqlite database in every cache-dir. For anonymous assets nothing would change, but for specific assets we'd create a new tmpdir per given asset and store the mapping in the database.
>>
>> Result:
>>
>>     .avocado_cache.sqlite
>>     foo-1X3s/foo.zip
>>     bar-3s2a/bar.zip
>>     foo-t3d2/foo.zip
>>
>> where ".avocado_cache.sqlite" would contain:
>>
>>     https://www.example.org/foo.zip  foo-1X3s/foo.zip
>>     https://www.example.org/bar.zip  bar-3s2a/bar.zip
>>     http://www.example.org/foo.zip   foo-t3d2/foo.zip
>>
>> Obviously by creating a db we could improve many things. First example would be to store expiration date and based on last access to db we could run cache-dir upkeep, removing outdated assets.
>>
>> Another improvement would be to store the downloaded asset hash and re-download&update hash when the file was modified even when user didn't provided hash.
>>
>> + easy to browse manually
>> + should be simple to expand the features (upkeep, security, ...)
>> + should simplify locks as we can atomically move the downloaded file&update db. Even crashes should lead to predictable behavior
>> - slightly more complicated to develop

This looks a bit overkill.

>> - "db" file would have to be protected
>>
>>
> 
> I'm a bit skeptical about the "improvements" here, not because they are
> not valid features, but because we don't need them at this point.  I
> don't think future possibilities should shape too much of the immediate
> architecture decisions.
> 
>> Other solutions
>> ---------------
>>
>> There are many other solutions like using `$basename-$url_hash` as the name or using `astring.string_to_safe_path` instead of url_hash and so on. We are open to suggestions.
[...]

suggestion:

unified_url = unify_scheme(url)
url_hash = SHA1(unified_url)
content = fetch(url)
if success:
    os.chdir(cache_dir)
    content_hash = SHA1(content)
    if not os.path.exists(content_hash):
        open(content_hash).write(content_hash)
    os.symlink(content_hash, url_hash)

So various urls can point to the same cached file.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20180621/543c5fa0/attachment.sig>