[Avocado-devel] no sigterm_handler stack trace on test timeout?

Lukáš Doktor ldoktor at redhat.com
Wed Nov 14 12:03:14 UTC 2018


Dne 13. 11. 18 v 16:22 Brian J. Murrell napsal(a):
> When I run a sample test:
> 
> ```
> import time
> from avocado import Test
> 
> class TimeoutTest(Test):
> 
>     timeout = 3
> 
>     def setUp(self):
>         print "setUp"
> 
>     def tearDown(self):
>         print "tearDown"
> 
>     def test(self):
>         try:
>             sleep_time = self.params.get('sleep_time', default=5)
>             self.log.info('Sleeping for %.2f seconds (%.2f more than the timeout)',
>                           sleep_time, sleep_time - self.timeout)
>             time.sleep(sleep_time)
>         except ValueError as e:
>             print e
> ```
> 
> to demonstrate what a timeout looks like I get:
> 
> ```
> 2018-11-13 10:02:55,588 varianter        L0112 DEBUG| PARAMS (key=timeout, path=*, default=3) => 3
> 2018-11-13 10:02:55,588 test             L0240 INFO | START 1-timeouttest.py:TimeoutTest.test
> 2018-11-13 10:02:55,725 output           L0655 DEBUG| setUp
> 2018-11-13 10:02:55,726 varianter        L0112 DEBUG| PARAMS (key=sleep_time, path=*, default=5) => 5
> 2018-11-13 10:02:55,726 timeouttest      L0032 INFO | Sleeping for 5.00 seconds (2.00 more than the timeout)
> 2018-11-13 10:02:58,623 stacktrace       L0039 ERROR| 
> 2018-11-13 10:02:58,624 stacktrace       L0042 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado/core/test.py:596
> 2018-11-13 10:02:58,624 stacktrace       L0045 ERROR| Traceback (most recent call last):
> 2018-11-13 10:02:58,625 stacktrace       L0045 ERROR|   File "/home/brian/daos/daos/src/tests/ftest/timeouttest.py", line 33, in test
> 2018-11-13 10:02:58,625 stacktrace       L0045 ERROR|     time.sleep(sleep_time)
> 2018-11-13 10:02:58,625 stacktrace       L0045 ERROR|   File "/usr/lib/python2.7/site-packages/avocado/core/runner.py", line 312, in sigterm_handler
> 2018-11-13 10:02:58,625 stacktrace       L0045 ERROR|     raise RuntimeError("Test interrupted by SIGTERM")
> 2018-11-13 10:02:58,626 stacktrace       L0045 ERROR| RuntimeError: Test interrupted by SIGTERM
> 2018-11-13 10:02:58,626 stacktrace       L0046 ERROR| 
> 2018-11-13 10:02:58,626 test             L0601 DEBUG| Local variables:
> 2018-11-13 10:02:58,663 test             L0604 DEBUG|  -> self <class 'timeouttest.TimeoutTest'>: 1-timeouttest.py:TimeoutTest.test
> 2018-11-13 10:02:58,663 test             L0604 DEBUG|  -> sleep_time <type 'int'>: 5
> 2018-11-13 10:02:58,663 output           L0655 DEBUG| tearDown
> 2018-11-13 10:02:58,664 test             L0729 ERROR| Traceback (most recent call last):
> 
> 2018-11-13 10:02:58,664 test             L0729 ERROR|   File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado
>     raise test_exception
> 
> 2018-11-13 10:02:58,664 test             L0729 ERROR| RuntimeError: Test interrupted by SIGTERM
> 
> 2018-11-13 10:02:58,664 test             L0744 ERROR| ERROR 1-timeouttest.py:TimeoutTest.test -> RuntimeError: Test interrupted by SIGTERM
> 2018-11-13 10:02:58,664 test             L0733 INFO | 
> 2018-11-13 10:02:58,674 runner           L0239 DEBUG| Original status: {'traceback': 'Traceback (most recent call last):\n  File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado\n    raise test_exception\nRuntimeError: Test interrupted by SIGTERM\n', 'class_name': 'TimeoutTest', 'time_end': 1542121378.664672, 'paused': False, 'time_start': 1542121375.724386, 'basedir': '/home/brian/daos/daos/src/tests/ftest', 'time_elapsed': 2.9402859210968018, 'params': <AvocadoParams *: * ([]),0: / (['//']),1: r ([]),2: u ([]),3: n ([])>, 'job_logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0', 'job_unique_id': '2629ff06bbecdbf6e3c6f49019d72fad50fdd190', 'srcdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test/src', 'logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test', 'status': 'ERROR', 'workdir': '/var/tmp/avocado
>  _ncd6mf/1-timeouttest.py_TimeoutTest.test', 'running': False, 'logfile': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test/debug.log', 'fail_reason': 'Test interrupted by SIGTERM', 'name': '1-timeouttest.py:TimeoutTest.test', 'whiteboard': '', 'fail_class': 'RuntimeError', 'timeout': 3, 'paused_msg': ''}
> 2018-11-13 10:02:58,674 runner           L0253 ERROR| ERROR Test reported status but did not finish -> TestAbortedError: 1-timeouttest.py:TimeoutTest.test.
> 2018-11-13 10:02:58,674 runner           L0061 ERROR| Runner error occurred: Timeout reached
> Original status: ERROR
> {'whiteboard': '', 'class_name': 'TimeoutTest', 'time_end': -1, 'paused': False, 'time_start': -1, 'early_status': True, 'basedir': '/home/brian/daos/daos/src/tests/ftest', 'time_elapsed': 3.099815845489502, 'params': <AvocadoParams *: * ([]),0: / (['//']),1: r ([]),2: u ([]),3: n ([])>, 'job_logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0', 'job_unique_id': '2629ff06bbecdbf6e3c6f49019d72fad50fdd190', 'srcdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test/src', 'logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test', 'status': 'ERROR', 'workdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test', 'text_output': '10:02:55 DEBUG| setUp\n10:02:55 DEBUG| PARAMS (key=sleep_time, path=*, default=5) => 5\n10:02:55 INFO | Sleeping for 5.00 seconds (2.00 more than the timeout)\n10:02:58 ERROR| \n10:02:58 ERROR| Reproduced traceback 
>  from: /usr/lib/python2.7/site-packages/avocado/core/test.py:596\n10:02:58 ERROR| Traceback (most recent call last):\n10:02:58 ERROR|   File "/home/brian/daos/daos/src/tests/ftest/timeouttest.py", line 33, in test\n10:02:58 ERROR|     time.sleep(sleep_time)\n10:02:58 ERROR|   File "/usr/lib/python2.7/site-packages/avocado/core/runner.py", line 312, in sigterm_handler\n10:02:58 ERROR|     raise RuntimeError("Test interrupted by SIGTERM")\n10:02:58 ERROR| RuntimeError: Test interrupted by SIGTERM\n10:02:58 ERROR| \n10:02:58 DEBUG| Local variables:\n10:02:58 DEBUG|  -> self <class \'timeouttest.TimeoutTest\'>: 1-timeouttest.py:TimeoutTest.test\n10:02:58 DEBUG|  -> sleep_time <type \'int\'>: 5\n10:02:58 DEBUG| tearDown\n10:02:58 ERROR| Traceback (most recent call last):\n\n10:02:58 ERROR|   File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado\n    raise test_exception\n\n10:02:58 ERROR| RuntimeError: Test interrupted by SIGTERM\n\n10:02:58 ERROR| ERROR 
>  1-timeouttest.py:TimeoutTest.test -> RuntimeError: Test interrupted by SIGTERM\n10:02:58 INFO | \n', 'running': False, 'logfile': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test/debug.log', 'fail_reason': 'Test reported status but did not finish', 'name': '1-timeouttest.py:TimeoutTest.test', 'traceback': 'Traceback not available', 'fail_class': 'type', 'timeout': 3, 'paused_msg': ''}
> ```
> 
> Observing the traceback that comes from sigterm_handler, I assume that
> every test that times out should have this.  Is this a correct
> assumption?
> 

Yes, this is correct.

> However during a test that ran in our testsuite last night I see:
> 
> ```
> 2018-11-13 00:48:14,478 varianter        L0112 DEBUG| PARAMS (key=timeout, path=*, default=None) => None
> 2018-11-13 00:48:14,478 test             L0240 INFO | START 3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67
> 2018-11-13 00:48:14,618 varianter        L0112 DEBUG| PARAMS (key=server_group, path=/server/, default=daos_server) => 'daos_server'
> 2018-11-13 00:48:14,632 varianter        L0112 DEBUG| PARAMS (key=test_machines, path=/run/hosts/*, default=None) => ['wolf-53vm2', 'wolf-53vm3', 'wolf-53vm4', 'wolf-53vm5']
> ...
> 2018-11-13 00:48:23,420 varianter        L0112 DEBUG| PARAMS (key=mode, path=/run/createtests/createmode/*/, default=None) => 146
> 2018-11-13 00:48:23,420 varianter        L0112 DEBUG| PARAMS (key=setname, path=/run/createtests/createset/, default=None) => 'daos_server'
> 2018-11-13 00:48:23,421 varianter        L0112 DEBUG| PARAMS (key=size, path=/run/createtests/createsize/, default=None) => 1073741
> 2018-11-13 00:48:23,421 varianter        L0112 DEBUG| PARAMS (key=svc, path=/run/createtests/createsvc/*/, default=None) => [3, 'PASS']
> 2018-11-13 00:48:31,701 process          L0389 INFO | Running '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader  --uuid=7E461A34-F049-4925-847A-5DA1D9D35530'
> 2018-11-13 00:48:32,095 process          L0499 INFO | Command '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader  --uuid=7E461A34-F049-4925-847A-5DA1D9D35530' finished with 0 after 0.216053962708s
> 2018-11-13 00:48:32,104 process          L0488 DEBUG| [stdout] SUCCESS killing pool service leader
> 2018-11-13 01:18:15,511 runner           L0239 DEBUG| Original status: {}
> 2018-11-13 01:18:15,555 runner           L0253 ERROR| ERROR Test died without reporting the status. -> TestAbortedError: 3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67.
> 2018-11-13 01:18:15,555 runner           L0255 WARNI| Killing hanged test process 29558
> 2018-11-13 01:18:16,663 runner           L0061 ERROR| Runner error occurred: Timeout reached
> Original status: ERROR
> {'whiteboard': '', 'class_name': 'PoolSvc', 'time_end': -1, 'paused': False, 'time_start': -1, 'early_status': True, 'basedir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool', 'time_elapsed': 1801.2555010318756, 'params': <AvocadoParams *: * ([]),0: /run/* (['/run/hosts/test_machine1/', '/run/server/', '/run/createtests/createmode/mode_RW/', '/run/createtests/createset/', '/run/createtests/createsize/', '/run/createtests/createsvc/svc3/'])>, 'job_logdir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocado/job-results/job-2018-11-13T00.47-df25f8c', 'job_unique_id': 'df25f8c77b77367c02120479c795140a2caa59bf', 'srcdir': '/var/tmp/avocado_9uWWRj/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py_PoolSvc.test_poolsvc_mode_RW-createset-createsize-svc3-test_machine1-server-aa67/src', 'logdir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocad
>  o/job-results/job-2018-11-13T00.47-df25f8c/test-results/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'status': 'ERROR', 'workdir': '/var/tmp/avocado_9uWWRj/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py_PoolSvc.test_poolsvc_mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'text_output': "00:48:14 DEBUG| PARAMS (key=server_group, path=/server/, default=daos_server) => 'daos_server'\n00:48:14 DEBUG| PARAMS (key=test_machines, path=/run/hosts/*, default=None) => ['wolf-53vm2', 'wolf-53vm3', 'wolf-53vm4', 'wolf-53vm5']\n00:48:14 DEBUG| Host file is: /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp/hostfile91180\n00:48:14 DEBUG| Start CMD>>>>/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/orterun --np 4 --hostfile /var/lib/jenkins/workspace/daos-sta
>  ck-org_daos_PR-13-centos7/install/tmp/hostfile91180 --enable-recovery -x DD_SUBSYS='' -x OFI_INTERFACE='eth0' -x DD_MASK='all' -x OFI_PORT='23350' -x CRT_ATTACH_INFO_PATH='/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp' -x D_LOG_MASK='DEBUG,RPC=ERR,MEM=ERR' -x DAOS_SINGLETON_CLI='1' -x ABT_ENV_MAX_NUM_XSTREAMS='64' -x DD_LOG='/tmp/Functional_quick/daos.log' -x CRT_PHY_ADDR_STR='ofi+sockets' -x D_LOG_FILE='/tmp/Functional_quick/daos.log' -x ABT_MAX_NUM_XSTREAMS='64' -x CRT_CTX_SHARE_ADDR='1' /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daos_server -g daos_server -c 1  -a /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp/\n00:48:18 DEBUG| <SERVER>: Starting Servers\n00:48:18 DEBUG| 2018/11/13 00:48:01 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:00 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org
>  _daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:03 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:47:59 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:03 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:03 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:48:00 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:00 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:48:01 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_serv
>  er.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:01 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:47:59 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:47:59 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| DAOS server (v0.0.2) process 17338 started on rank 1 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 25654 started on rank 2 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 5135 started on rank 3 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 8903 started on rank 0 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| \n00:48:18 DEBUG| <SERVER> server started and took 3.10468387604 seconds to start\n00:48:23 DEBUG| PARAMS (key=mode, path=/run/createtests/createmode/*/, default=None) => 146\n00:48:23 DEBUG| PARAMS (key=setname, path=/
>  run/createtests/createset/, default=None) => 'daos_server'\n00:48:23 DEBUG| PARAMS (key=size, path=/run/createtests/createsize/, default=None) => 1073741\n00:48:23 DEBUG| PARAMS (key=svc, path=/run/createtests/createsvc/*/, default=None) => [3, 'PASS']\n00:48:31 INFO | Running '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader  --uuid=7E461A34-F049-4925-847A-5DA1D9D35530'\n00:48:32 INFO | Command '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader  --uuid=7E461A34-F049-4925-847A-5DA1D9D35530' finished with 0 after 0.216053962708s\n00:48:32 DEBUG| [stdout] SUCCESS killing pool service leader\n", 'running': False, 'logfile': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocado/job-results/job-2018-11-13T00.47-df25f8c/test-results/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-creates
>  ize-svc3-test_machine1-server-aa67/debug.log', 'fail_reason': 'Test died without reporting the status.', 'name': '3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'traceback': 'Traceback not available', 'fail_class': 'type', 'timeout': 1800, 'paused_msg': ''}
> ```
> 
> Without any of the timeout stack trace nor any evidence that the test's
> tearDown() was run.  Shouldn't I see the sigterm_handler stack trace at
> minimum?
> 

For a really long time I had been neglecting such issues in my CI because is was not critical for me. I suspect the problem is the very short timeout in case we interrupt test (it's just 1s from the time we interrupt the test, while usually it's 10s since the test reports status). So simple tests does the handling and finish, but especially tests that utilize block devices extensively might not finish on time resulting either in the test process not even getting to handle the SIGTERM on time (no status) or it reports the status but won't finish on time (process not finished).

As I mentioned in `timeout during setUp` I'd suggest increasing the 1s timeout, ideally use the standard timeouts as failed test should actually need even more time to postprocess.

Sorry for not looking at this earlier, I waited for someone else to report it...

Regards,
Lukáš

> Cheers,
> b.
> 
> 
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20181114/d61365b4/attachment.sig>


More information about the Avocado-devel mailing list