[Softwarefactory-dev] Zuul may be randomly missing ref-updated event.

Tristan Cacqueray tdecacqu at redhat.com
Fri Dec 2 07:34:03 UTC 2016


Hello folks,

apologize for this long post, but this issue is randomly failing CI job
and I would like to share the result of a long and subtle gray hair
inducing investigation :)

The most obvious symptom in the CI tests is the health-check failure of
this task: "Check for config-update post job". It can't find the commit
sha in config-update/builds/lastStableBuild logs. This is abnormal since
a previous task made sure the change has been Verified +2 and that it
was successfully merged, so there should have been a config-update run.

I believe this failure may also be affecting other tests that rely on
the config-update job to succeed (such as the test_project_test and
test_resource_workflow).


Here is an analysis of this failure:
https://softwarefactory-project.io/jenkins/job/software-factory-functional-trackerless/46/console

The relevant lines are:
[+] Starting the health base check
Fri Dec  2 00:33:53 UTC 2016 Running health-check/zuul.yaml
Zuul integration test SUCCESS
Fri Dec  2 00:35:43 UTC 2016 Running health-check/gerritbot.yaml
[...]
TASK [Check for config-update post job]
****************************************
fatal: [managesf.sftests.com]: FAILED! => {"changed": false, "elapsed":
1200, "failed": true, "msg": "Timeout when waiting for search string
67ddb89966cc4999df1869d99d140dd969963416 in
/var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"}
	to retry, use: --limit @/etc/ansible/health-check/gerritbot.retry


Those two health-check (zuul and gerritbot) are very similar as they
both trigger a config-update job.  Looking in zuul.log, we can observe
how the first change (1,1) is being merged and how the config-update is
being successfully executed:
https://softwarefactory-project.io/paste/show/661/

As you can see, gerrit.GerritWatcher received two events: change-merged
and ref-updated, resulting in two trigger event eventually running the
config-update.


Then, gerritbot test merge another config change (4,1),
gerrit.GerritWatcher didn't received the ref-update event and the
config-update didn't get triggered:
https://softwarefactory-project.io/paste/show/662/


The question is, does gerrit failed to emit that event, or does zuul
failed to process it? Using a fresh sf deployment, I tried to reproduce
the failure while dumping the gerrit events-stream in parallel:

    ssh sftests.com gerrit stream-events > stream.log &
    while true; do
	ansible-playbook health-check/zuul.yaml && \
	ansible-playbook health-check/gerritbot.yaml || break;
    done


After about 3 hours of execution, it finally reproduces the infamous error:
"Timeout when waiting for search string
65e5ef9b9c4de1d1e3f037f79efe9b55a7da9ccd in
/var/lib/jenkins/jobs/config-update/builds/lastStableBuild/build.xml"

The stream.log do have both events logged:
https://softwarefactory-project.io/paste/show/664/

The zuul.log doesn't show any trace of the last updated-ref:
https://softwarefactory-project.io/paste/show/665/


That's it for now, next is to debug the GerritWatcher, but since it's
happening so rarely, it's not an easy task.

Alternatively, it may be worthy to run post pipeline on change-merged
instead of ref-updated.

If anybody have an idea what's going with that failure, please chime in!

Yours,
-Tristan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/softwarefactory-dev/attachments/20161202/a6690e9e/attachment.sig>


More information about the Softwarefactory-dev mailing list