[PATCH 59] make sure SINGLESTEP doesn't miss SYSCALL_EXIT

Wed Oct 7 06:34:33 UTC 2009

Sorry I'm so long in following up on this.  
I know you've hacked a lot more since.

> The vanilla kernel relies on syscall_trace_leave() which does
> send_sigtrap(TRAP_BRKPT). But with utrace-ptrace the tracee sleeps
> in do_notify_resume() path, syscall_trace_leave() won't be called.

It's even weirder than this, in nonobvious ways.  I'm not even sure I'm
saying anything contrary to what you've coded for, but I want to be clear
and explicit about everything.  I'm concerned we might accidentally code to
quirks of the x86 implementation details where we should instead make the
arch tracehook calls' interface contracts more clear and make sure
everything we do in generic utrace and ptrace layers jibes in theory and
not just in observed x86 behavior.

What happens today in the vanilla x86 kernel is this:

1. user_enable_single_step -> TIF_SINGLESTEP done to user state before a
   syscall insn makes sure (if nothing else) we'll hit syscall_trace_leave
   after sys_foo returns.
1b. If done to a thread state of "in the syscall" as in syscall-entry,
    clone, or exec report, that too makes sure we get there even if we
    hadn't been single-stepping into the syscall insn.
2. syscall_trace_leave -> tracehook_report_syscall_exit ->... ptrace_stop
3a. user_enable_single_step -> TIF_SINGLESTEP (was already and still is set)
3b. user_disable_single_step (PTRACE_CONT et al) -> clear TIF_SINGLESTEP
4. tracehook_report_syscall_exit returns to syscall_trace_leave ->
5a. (if 3a) test_thread_flag(TIF_SINGLESTEP)=>1 -> send_sigtrap
5b. (if 3b) test_thread_flag(TIF_SINGLESTEP)=>0 -> no signal, no stop
6. (if 5a) do_notify_resume -> dequeue SIGTRAP, ptrace_signal->ptrace_stop

I found it nonobvious that what we really do is ptrace_stop inside
tracehook_report_syscall_exit and then upon resumption look afresh
at TIF_SINGLESTEP to determine whether we really send a SIGTRAP.

Note that #6 could always be preceded e.g. by a dequeue of SIGHUP,
ptrace_stop for that, handler setup for that with its own ptrace_notify()
stop--not really a signal and so not touching the queued SIGTRAP--inside
tracehook_signal_handler (if used PTRACE_SINGLESTEP,SIGHUP to enter the
handler), etc., so the "stepped" state finally observed at the signal stop
for that SIGTRAP could be wildly different from the post-syscall state.

In comparison, on powerpc it is the same up to #2 and then
do_syscall_trace_leave does:

	step = test_thread_flag(TIF_SINGLESTEP);
	if (step || test_thread_flag(TIF_SYSCALL_TRACE))
		tracehook_report_syscall_exit(regs, step);

And that's it.  (x86 in #2 always calls it with step=0.)

Vanilla tracehook_report_syscall_exit in fact ignores the step flag, though
I thought at some point before it had a "ptrace_notify(SIGTRAP)" there.

So on powerpc, PTRACE_SINGLESTEP done from the syscall-exit stop will not
give an immediate (fake) single-step trap at the immediate post-syscall
insn, but just single-step the next one insn (I think).  I'm sure nobody
intended this to be an arch difference, and I'm not sure how much we are
really stuck with either choice in particular for application compatibility
on one arch or the other.

There is no particular reason for one arch to queue a SIGTRAP and another
to use tracehook_report_syscall_exit for single-step cases, just the taste
of each arch maintainer.  IIRC there were even different opinions about
whether PTRACE_SINGLESTEP producing a SYSGOOD-style SIGTRAP|0x80 when you
happened to step over a syscall insn was a helpful feature giving one more
bit of information or an unhelpful bug making PTRACE_SINGLESTEP behave
inconsistently with one particular sort of user insn vs all others.  I
didn't want to keep arguing it out at the time.

I included the step flag when formulating tracehook_report_syscall_exit
with the hope that arch maintainers could be easily convinced to pass in
that flag so at least future generic code (utrace, whatever) could have
enough information for whatever behavior we wanted without more arch work.
That much seems almost to have panned out (some arch's do pass in that
flag).  (Probably the biggest reason that x86 does what it does and others
don't is just that it has the send_sigtrap() helper so it's a clean-looking
one-line call.)

As to the ptrace behavior, I don't think it much matters one way or the
other about the two stops.  If you just used PTRACE_SYSCALL (twice) and so
are now at the syscall-exit stop, and then use PTRACE_SINGLESTEP, you
really don't need an extra stop with exactly the same register state.  The
only thing that can happen is a signal, and ptrace is always going to give
you an intervening stop for that anyway.  OTOH, I never thought it was
right to give a SYSGOOD status code in response to PTRACE_SINGLESTEP that
happens to be over a syscall insn, so neither the x86 status quo nor the
powerpc status quo really floats my boat.

The tracehook and utrace layers have yet another set of considerations that
never came up with ptrace.  You can't use both PTRACE_SINGLESTEP and
PTRACE_SYSCALL and simultaneously.  But in utrace you can single-step and
get syscall-exit callbacks, either both by one engine or some engines doing
one and some doing the other.

I don't have any conclusions or prescriptions about any of this right at
the moment.  It just seems important to cite that all this is kicking
around to begin with before I threw in the new wrinkle of utrace not
wanting to stop inside tracehook_report_syscall_exit but instead only later
after its caller will have returned.

> 	- perhaps it makes sense to change force_sig_info() to use
> 	  lock_task_sighand(), then we can avoid taking taskist around
> 	  send_sigtrap().

This feels like an indicator that it would be better to call send_sigtrap
in the tracee instead.

> > 	- I guess, we should also send SIGTRAP when SINGLESTEP resumes
> > 	  the tracee from SYSCALL_EXIT stop. This is trivial, but I need
> > 	  to write the test-case first.
> 
> Yes, with this change the kernel passes all tests plus the test-case
> below. I wrote it because I lost the hope to understand the required
> behaviour ;)

We should get a case for this into ptrace-tests and then look into (with
Jan) what we want to do about consistency across arch's.

> Note also the couple of "assert(regs.rax == -ENOSYS)" which were commented
> out, here we differ from vanilla kernel and this can't be "fixed". But,
> as you pointed out previously, we do not need to be bug-compatible here.

Right.  The test can assert(regs.rax == -ENOSYS || regs.rax == child_pid).
We want those sites clearly commented to go along with the explanation to
upstream about the intentional semantics change.

Thanks,
Roland