[Avocado-devel] no sigterm_handler stack trace on test timeout?
Lukáš Doktor
ldoktor at redhat.com
Wed Nov 14 12:03:14 UTC 2018
Dne 13. 11. 18 v 16:22 Brian J. Murrell napsal(a):
> When I run a sample test:
>
> ```
> import time
> from avocado import Test
>
> class TimeoutTest(Test):
>
> timeout = 3
>
> def setUp(self):
> print "setUp"
>
> def tearDown(self):
> print "tearDown"
>
> def test(self):
> try:
> sleep_time = self.params.get('sleep_time', default=5)
> self.log.info('Sleeping for %.2f seconds (%.2f more than the timeout)',
> sleep_time, sleep_time - self.timeout)
> time.sleep(sleep_time)
> except ValueError as e:
> print e
> ```
>
> to demonstrate what a timeout looks like I get:
>
> ```
> 2018-11-13 10:02:55,588 varianter L0112 DEBUG| PARAMS (key=timeout, path=*, default=3) => 3
> 2018-11-13 10:02:55,588 test L0240 INFO | START 1-timeouttest.py:TimeoutTest.test
> 2018-11-13 10:02:55,725 output L0655 DEBUG| setUp
> 2018-11-13 10:02:55,726 varianter L0112 DEBUG| PARAMS (key=sleep_time, path=*, default=5) => 5
> 2018-11-13 10:02:55,726 timeouttest L0032 INFO | Sleeping for 5.00 seconds (2.00 more than the timeout)
> 2018-11-13 10:02:58,623 stacktrace L0039 ERROR|
> 2018-11-13 10:02:58,624 stacktrace L0042 ERROR| Reproduced traceback from: /usr/lib/python2.7/site-packages/avocado/core/test.py:596
> 2018-11-13 10:02:58,624 stacktrace L0045 ERROR| Traceback (most recent call last):
> 2018-11-13 10:02:58,625 stacktrace L0045 ERROR| File "/home/brian/daos/daos/src/tests/ftest/timeouttest.py", line 33, in test
> 2018-11-13 10:02:58,625 stacktrace L0045 ERROR| time.sleep(sleep_time)
> 2018-11-13 10:02:58,625 stacktrace L0045 ERROR| File "/usr/lib/python2.7/site-packages/avocado/core/runner.py", line 312, in sigterm_handler
> 2018-11-13 10:02:58,625 stacktrace L0045 ERROR| raise RuntimeError("Test interrupted by SIGTERM")
> 2018-11-13 10:02:58,626 stacktrace L0045 ERROR| RuntimeError: Test interrupted by SIGTERM
> 2018-11-13 10:02:58,626 stacktrace L0046 ERROR|
> 2018-11-13 10:02:58,626 test L0601 DEBUG| Local variables:
> 2018-11-13 10:02:58,663 test L0604 DEBUG| -> self <class 'timeouttest.TimeoutTest'>: 1-timeouttest.py:TimeoutTest.test
> 2018-11-13 10:02:58,663 test L0604 DEBUG| -> sleep_time <type 'int'>: 5
> 2018-11-13 10:02:58,663 output L0655 DEBUG| tearDown
> 2018-11-13 10:02:58,664 test L0729 ERROR| Traceback (most recent call last):
>
> 2018-11-13 10:02:58,664 test L0729 ERROR| File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado
> raise test_exception
>
> 2018-11-13 10:02:58,664 test L0729 ERROR| RuntimeError: Test interrupted by SIGTERM
>
> 2018-11-13 10:02:58,664 test L0744 ERROR| ERROR 1-timeouttest.py:TimeoutTest.test -> RuntimeError: Test interrupted by SIGTERM
> 2018-11-13 10:02:58,664 test L0733 INFO |
> 2018-11-13 10:02:58,674 runner L0239 DEBUG| Original status: {'traceback': 'Traceback (most recent call last):\n File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado\n raise test_exception\nRuntimeError: Test interrupted by SIGTERM\n', 'class_name': 'TimeoutTest', 'time_end': 1542121378.664672, 'paused': False, 'time_start': 1542121375.724386, 'basedir': '/home/brian/daos/daos/src/tests/ftest', 'time_elapsed': 2.9402859210968018, 'params': <AvocadoParams *: * ([]),0: / (['//']),1: r ([]),2: u ([]),3: n ([])>, 'job_logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0', 'job_unique_id': '2629ff06bbecdbf6e3c6f49019d72fad50fdd190', 'srcdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test/src', 'logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test', 'status': 'ERROR', 'workdir': '/var/tmp/avocado
> _ncd6mf/1-timeouttest.py_TimeoutTest.test', 'running': False, 'logfile': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test/debug.log', 'fail_reason': 'Test interrupted by SIGTERM', 'name': '1-timeouttest.py:TimeoutTest.test', 'whiteboard': '', 'fail_class': 'RuntimeError', 'timeout': 3, 'paused_msg': ''}
> 2018-11-13 10:02:58,674 runner L0253 ERROR| ERROR Test reported status but did not finish -> TestAbortedError: 1-timeouttest.py:TimeoutTest.test.
> 2018-11-13 10:02:58,674 runner L0061 ERROR| Runner error occurred: Timeout reached
> Original status: ERROR
> {'whiteboard': '', 'class_name': 'TimeoutTest', 'time_end': -1, 'paused': False, 'time_start': -1, 'early_status': True, 'basedir': '/home/brian/daos/daos/src/tests/ftest', 'time_elapsed': 3.099815845489502, 'params': <AvocadoParams *: * ([]),0: / (['//']),1: r ([]),2: u ([]),3: n ([])>, 'job_logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0', 'job_unique_id': '2629ff06bbecdbf6e3c6f49019d72fad50fdd190', 'srcdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test/src', 'logdir': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test', 'status': 'ERROR', 'workdir': '/var/tmp/avocado_ncd6mf/1-timeouttest.py_TimeoutTest.test', 'text_output': '10:02:55 DEBUG| setUp\n10:02:55 DEBUG| PARAMS (key=sleep_time, path=*, default=5) => 5\n10:02:55 INFO | Sleeping for 5.00 seconds (2.00 more than the timeout)\n10:02:58 ERROR| \n10:02:58 ERROR| Reproduced traceback
> from: /usr/lib/python2.7/site-packages/avocado/core/test.py:596\n10:02:58 ERROR| Traceback (most recent call last):\n10:02:58 ERROR| File "/home/brian/daos/daos/src/tests/ftest/timeouttest.py", line 33, in test\n10:02:58 ERROR| time.sleep(sleep_time)\n10:02:58 ERROR| File "/usr/lib/python2.7/site-packages/avocado/core/runner.py", line 312, in sigterm_handler\n10:02:58 ERROR| raise RuntimeError("Test interrupted by SIGTERM")\n10:02:58 ERROR| RuntimeError: Test interrupted by SIGTERM\n10:02:58 ERROR| \n10:02:58 DEBUG| Local variables:\n10:02:58 DEBUG| -> self <class \'timeouttest.TimeoutTest\'>: 1-timeouttest.py:TimeoutTest.test\n10:02:58 DEBUG| -> sleep_time <type \'int\'>: 5\n10:02:58 DEBUG| tearDown\n10:02:58 ERROR| Traceback (most recent call last):\n\n10:02:58 ERROR| File "/usr/lib/python2.7/site-packages/avocado/core/test.py", line 667, in _run_avocado\n raise test_exception\n\n10:02:58 ERROR| RuntimeError: Test interrupted by SIGTERM\n\n10:02:58 ERROR| ERROR
> 1-timeouttest.py:TimeoutTest.test -> RuntimeError: Test interrupted by SIGTERM\n10:02:58 INFO | \n', 'running': False, 'logfile': '/home/brian/daos/daos/src/tests/ftest/avocado/job-results/job-2018-11-13T10.02-2629ff0/test-results/1-timeouttest.py:TimeoutTest.test/debug.log', 'fail_reason': 'Test reported status but did not finish', 'name': '1-timeouttest.py:TimeoutTest.test', 'traceback': 'Traceback not available', 'fail_class': 'type', 'timeout': 3, 'paused_msg': ''}
> ```
>
> Observing the traceback that comes from sigterm_handler, I assume that
> every test that times out should have this. Is this a correct
> assumption?
>
Yes, this is correct.
> However during a test that ran in our testsuite last night I see:
>
> ```
> 2018-11-13 00:48:14,478 varianter L0112 DEBUG| PARAMS (key=timeout, path=*, default=None) => None
> 2018-11-13 00:48:14,478 test L0240 INFO | START 3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67
> 2018-11-13 00:48:14,618 varianter L0112 DEBUG| PARAMS (key=server_group, path=/server/, default=daos_server) => 'daos_server'
> 2018-11-13 00:48:14,632 varianter L0112 DEBUG| PARAMS (key=test_machines, path=/run/hosts/*, default=None) => ['wolf-53vm2', 'wolf-53vm3', 'wolf-53vm4', 'wolf-53vm5']
> ...
> 2018-11-13 00:48:23,420 varianter L0112 DEBUG| PARAMS (key=mode, path=/run/createtests/createmode/*/, default=None) => 146
> 2018-11-13 00:48:23,420 varianter L0112 DEBUG| PARAMS (key=setname, path=/run/createtests/createset/, default=None) => 'daos_server'
> 2018-11-13 00:48:23,421 varianter L0112 DEBUG| PARAMS (key=size, path=/run/createtests/createsize/, default=None) => 1073741
> 2018-11-13 00:48:23,421 varianter L0112 DEBUG| PARAMS (key=svc, path=/run/createtests/createsvc/*/, default=None) => [3, 'PASS']
> 2018-11-13 00:48:31,701 process L0389 INFO | Running '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader --uuid=7E461A34-F049-4925-847A-5DA1D9D35530'
> 2018-11-13 00:48:32,095 process L0499 INFO | Command '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader --uuid=7E461A34-F049-4925-847A-5DA1D9D35530' finished with 0 after 0.216053962708s
> 2018-11-13 00:48:32,104 process L0488 DEBUG| [stdout] SUCCESS killing pool service leader
> 2018-11-13 01:18:15,511 runner L0239 DEBUG| Original status: {}
> 2018-11-13 01:18:15,555 runner L0253 ERROR| ERROR Test died without reporting the status. -> TestAbortedError: 3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67.
> 2018-11-13 01:18:15,555 runner L0255 WARNI| Killing hanged test process 29558
> 2018-11-13 01:18:16,663 runner L0061 ERROR| Runner error occurred: Timeout reached
> Original status: ERROR
> {'whiteboard': '', 'class_name': 'PoolSvc', 'time_end': -1, 'paused': False, 'time_start': -1, 'early_status': True, 'basedir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool', 'time_elapsed': 1801.2555010318756, 'params': <AvocadoParams *: * ([]),0: /run/* (['/run/hosts/test_machine1/', '/run/server/', '/run/createtests/createmode/mode_RW/', '/run/createtests/createset/', '/run/createtests/createsize/', '/run/createtests/createsvc/svc3/'])>, 'job_logdir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocado/job-results/job-2018-11-13T00.47-df25f8c', 'job_unique_id': 'df25f8c77b77367c02120479c795140a2caa59bf', 'srcdir': '/var/tmp/avocado_9uWWRj/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py_PoolSvc.test_poolsvc_mode_RW-createset-createsize-svc3-test_machine1-server-aa67/src', 'logdir': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocad
> o/job-results/job-2018-11-13T00.47-df25f8c/test-results/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'status': 'ERROR', 'workdir': '/var/tmp/avocado_9uWWRj/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py_PoolSvc.test_poolsvc_mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'text_output': "00:48:14 DEBUG| PARAMS (key=server_group, path=/server/, default=daos_server) => 'daos_server'\n00:48:14 DEBUG| PARAMS (key=test_machines, path=/run/hosts/*, default=None) => ['wolf-53vm2', 'wolf-53vm3', 'wolf-53vm4', 'wolf-53vm5']\n00:48:14 DEBUG| Host file is: /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp/hostfile91180\n00:48:14 DEBUG| Start CMD>>>>/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/orterun --np 4 --hostfile /var/lib/jenkins/workspace/daos-sta
> ck-org_daos_PR-13-centos7/install/tmp/hostfile91180 --enable-recovery -x DD_SUBSYS='' -x OFI_INTERFACE='eth0' -x DD_MASK='all' -x OFI_PORT='23350' -x CRT_ATTACH_INFO_PATH='/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp' -x D_LOG_MASK='DEBUG,RPC=ERR,MEM=ERR' -x DAOS_SINGLETON_CLI='1' -x ABT_ENV_MAX_NUM_XSTREAMS='64' -x DD_LOG='/tmp/Functional_quick/daos.log' -x CRT_PHY_ADDR_STR='ofi+sockets' -x D_LOG_FILE='/tmp/Functional_quick/daos.log' -x ABT_MAX_NUM_XSTREAMS='64' -x CRT_CTX_SHARE_ADDR='1' /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daos_server -g daos_server -c 1 -a /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/tmp/\n00:48:18 DEBUG| <SERVER>: Starting Servers\n00:48:18 DEBUG| 2018/11/13 00:48:01 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:00 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org
> _daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:03 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:47:59 DAOS config read from /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/daos_server.yml\n00:48:18 DEBUG| 2018/11/13 00:48:03 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:03 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:48:00 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:00 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:48:01 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_serv
> er.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:48:01 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| 2018/11/13 00:47:59 Active config saved to /var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/etc/.daos_server.active.yml (read-only)\n00:48:18 DEBUG| 2018/11/13 00:47:59 Management interface listening on: 0.0.0.0:10000\n00:48:18 DEBUG| DAOS server (v0.0.2) process 17338 started on rank 1 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 25654 started on rank 2 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 5135 started on rank 3 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| DAOS server (v0.0.2) process 8903 started on rank 0 (out of 4) with 1 xstream(s)\n00:48:18 DEBUG| \n00:48:18 DEBUG| <SERVER> server started and took 3.10468387604 seconds to start\n00:48:23 DEBUG| PARAMS (key=mode, path=/run/createtests/createmode/*/, default=None) => 146\n00:48:23 DEBUG| PARAMS (key=setname, path=/
> run/createtests/createset/, default=None) => 'daos_server'\n00:48:23 DEBUG| PARAMS (key=size, path=/run/createtests/createsize/, default=None) => 1073741\n00:48:23 DEBUG| PARAMS (key=svc, path=/run/createtests/createsvc/*/, default=None) => [3, 'PASS']\n00:48:31 INFO | Running '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader --uuid=7E461A34-F049-4925-847A-5DA1D9D35530'\n00:48:32 INFO | Command '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/install/bin/daosctl kill-leader --uuid=7E461A34-F049-4925-847A-5DA1D9D35530' finished with 0 after 0.216053962708s\n00:48:32 DEBUG| [stdout] SUCCESS killing pool service leader\n", 'running': False, 'logfile': '/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/avocado/job-results/job-2018-11-13T00.47-df25f8c/test-results/3-_var_lib_jenkins_workspace_daos-stack-org_daos_PR-13-centos7_src_tests_ftest_pool_PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-creates
> ize-svc3-test_machine1-server-aa67/debug.log', 'fail_reason': 'Test died without reporting the status.', 'name': '3-/var/lib/jenkins/workspace/daos-stack-org_daos_PR-13-centos7/src/tests/ftest/pool/PoolSvc.py:PoolSvc.test_poolsvc;mode_RW-createset-createsize-svc3-test_machine1-server-aa67', 'traceback': 'Traceback not available', 'fail_class': 'type', 'timeout': 1800, 'paused_msg': ''}
> ```
>
> Without any of the timeout stack trace nor any evidence that the test's
> tearDown() was run. Shouldn't I see the sigterm_handler stack trace at
> minimum?
>
For a really long time I had been neglecting such issues in my CI because is was not critical for me. I suspect the problem is the very short timeout in case we interrupt test (it's just 1s from the time we interrupt the test, while usually it's 10s since the test reports status). So simple tests does the handling and finish, but especially tests that utilize block devices extensively might not finish on time resulting either in the test process not even getting to handle the SIGTERM on time (no status) or it reports the status but won't finish on time (process not finished).
As I mentioned in `timeout during setUp` I'd suggest increasing the 1s timeout, ideally use the standard timeouts as failed test should actually need even more time to postprocess.
Sorry for not looking at this earlier, I waited for someone else to report it...
Regards,
Lukáš
> Cheers,
> b.
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/avocado-devel/attachments/20181114/d61365b4/attachment.sig>
More information about the Avocado-devel
mailing list