ptrace cleanup tasks

Thu Apr 16 08:38:16 UTC 2009

Sorry for the delay in replying today.  I fell into a deep pit of DWARF
arcana all day and let your message sit (and not even the regularly
scheduled DWARF arcana I had intended to work on this week!).

> Yes sure. But I think this part is most important and most hard, and
> other changes may depend on locking very much.

True enough.  Off hand I don't think most of the individual data structure
cleanup items I suggested differ much from one kind of locking to another.
But if you are actively busy cogitating on the locking, no reason not to
keep at that first.  (I just didn't want anything easier to get blocked
while waiting for feedback from me.)

> Sure! The problem is I still can't see how can we do this without
> complicating the current code "too much". I spent several hours doing
> nothing except thinking about these changes, but all I can say now is:
> I need to think more ;) And of course I appreciate your ideas.

I figured it might be tough, but more or less all the big hopes depend on
it being possible to make it cleaner by going this way.

> Just for example, it would be really nice to avoid taking current->ptrace_mutex
> in exit_ptrace() when there are no tracees, but I don't see how can we do this
> without races with ptrace_traceme(). But this is a minor detail.

Right.

> Consider 2 threads, T1 and T2. T2 forks the child C.
> 
> 	T1 calls do_wait(). It scans T->children and finds nothing.

Scans which?  T1->children scan will see nothing.  So say it has
__WNOTHREAD, so then T1 does not also scan T2->children later.

> 	Then it calls ptrace_do_wait() which temporary drops tasklist.
> 
> 	T2 exits, reparents C to T1.

If C is zombie that does do_notify_parent->wake_up_parent.  
If C is stopped or running, no wakeup.

> 	T1 finds nothing interesting in ->ptraced (not that it matters, but
> 	it is possible the list was not empty when we enter ptrace_do_wait,
> 	but it is empty when we take ->ptrace_mutex).

Right, fine.  BTW, we need to remember to __set_current_state back to
TASK_INTERRUPTIBLE before taking the tasklist_lock again (after the mutex
et al presumably changed ->state).

> 	T1 takes tasklist again. We check tsk == T1 has not exited (or we
> 	some other check) and continue.
> 
> 	do_wait() returns -ECHILD.

With __WNOTHREAD, yes.  And that's fine.  It was true when the wait call
was made, and it didn't block.  From the user's perspective, the whole wait
call happened "before" T2 finished exiting.

Without __WNOTHREAD, then T1 scanned T2->children (and T2->ptraced) after
the "T1 takes tasklist again" step in your scenario, but C was already
gone.  If C was already zombie when reparented, T2 did wake_up_parent while
T1 had released tasklist_lock.  But if there were no other eligible
children, then it returns -ECHILD before even trying to block.

> We need something more clever here.

Yes.  I think we can do it with two hacks, which means no more than four or
five actual hacks.  First, make reparenting always do a wake_up_parent, not
just for zombies.  Second, upon retaking tasklist_lock, ensure that if
we've been woken, we reset *notask_error = 0 so we'll schedule() and not
really block, and then restart; or if WNOHANG, just return -ERESTARTNOINTR
(or a chosen retval to goto repeat without actual syscall return/restart,
e.g. -EAGAIN and do_wait checks for that so we don't set TIF_SIGPENDING).

@@ -1611,6 +1611,13 @@ repeat:
 	} while (tsk != current);
 	read_unlock(&tasklist_lock);
 
+	if (retval == -EAGAIN) {
+		if (signal_pending(current))
+			retval = -ERESTARTSYS;
+		else
+			goto repeat;
+	}
+
 	if (!retval && !(options & WNOHANG)) {
 		retval = -ERESTARTSYS;
 		if (!signal_pending(current)) {

For "if we've been woken", it might make sense to roll this in with dusting
off and finishing up the patch that optimizes do_wait() with __WNOTHREAD or
pid constraints.  It uses a custom wake_queue.func function so that a
blocked do_wait() uninterested in the dying/stopping waker does not get
woken up just to spin around the list and block.  

The custom wake function could also do something like use a
container_of(wait_queue,,) to find the notask_error ("retval" in do_wait)
inside a larger struct on the do_wait stack, and poke it to -EAGAIN.  That
happens under the wait_queue_head lock, so it's synchronized finished after
remove_wait_queue().  Maybe also needs:

@@ -1622,6 +1629,11 @@ repeat:
 end:
 	current->state = TASK_RUNNING;
 	remove_wait_queue(&current->signal->wait_chldexit,&wait);
+	if ((!retval || retval == -ECHILD) && wait.notask_error == -EAGAIN) {
+		if (signal_pending())
+			return -ERESTARTSYS;
+		goto start_wait_queue;
+	}
 	if (infop) {
 		if (retval > 0)
 			retval = 0;


Then every wake_up_parent, including the new one for all reparentings,
either makes schedule() return, or makes a dry scan after retaking
tasklist_lock do a repeat from the top.

> Yes... but we can have some nasty corener cases which are not bugs, but
> still not good.

There's probably only "acceptable" and "wrong" for these corners.  (If it's
really "not good" so as to care about, it's close enough to call it a bug.)

> Two threads T1 and T2. T1 has a TASK_STOPPED child C. T2 in the middle
> of sys_ptrace(PTRACE_ATTACH, C).

You are a bad, bad man.

> T1 does do_wait(WSTOPPED). It is possible that we already see C->ptrace
> (so do_wait_thread(T1->children) just clears *notask_error), but we
> don't see C in T2->ptraced list.

"In the middle" of PTRACE_ATTACH means T2 holds T2->ptrace_mutex.  Before
it takes the mutex, C->ptrace is clear (unless racing just after D does
PTRACE_DETACH to make T2's PTRACE_ATTACH possible).

T1 takes T2->ptrace_mutex to examine T2->ptraced.  If T2 has just set
C->ptrace, it both did so and put C on T2->ptraced while holding the mutex.
T1 will see C there.

Before that, say C->ptrace was set from prior attach by D.  T1 saw
C->ptrace in do_wait_thread and did not report C.  Now, D does
sys_ptrace(PTRACE_DETACH, C), so clears C->ptrace while holding
D->ptrace_mutex.  Next T1 gets T2->ptrace_mutex before T2's racing
PTRACE_ATTACH to C.  It does not see C there.  Now T1 blocks.  

This feels similar to the first scenario above.  wake_up_parent in
PTRACE_DETACH might do it similar to the scheme above.  Then T1 restarts
(even for WNOHANG).  On the second pass, either it sees !C->ptrace in
do_wait_thread and reports it stopped as natural parent (correct for it
winning the race with PTRACE_ATTACH), or it sees C->ptrace already set
again by T2 and then finds C on T2->ptraced.

> Oh. Will try to think more ;)

I keep tryin' to think but nothin' happens!

The original reorganization of do_wait into do_wait_thread/ptrace_do_wait
was motivated by eventually doing this conversion.  If there is a different
way to wholly reorganize it again that makes this cleaner, we can consider
that too.


Thanks,
Roland