Utrace .report_jctl serialization issue ?

dave baukus dbaukus at conveycomputer.com
Thu Nov 11 20:38:25 UTC 2010


Is a utrace engine with .report_jctl enabled suppose to handle
do_notify_parent_cldstop(current, notify) processing for the last
stopping task ?  Or should it muck with task->ptrace to force
tracehook_notify_jctl() to return a non-zero value ?

I ask because I have a simple multi-threaded process with a utrace engine
attached to the process group leader; .report_jctl is enabled.
If I SIGTSTP the process, occasionally control is not returned to the 
shell.
On my 2.6.32-44.2.el6 kernel this happens because when utrace_report_jctl()
releases spin_unlock_irq(&task->sighand->siglock) it breaks serialization
with sig->group_stop_count as required by do_signal_stop()'s
do_notify_parent_cldstop(current, notify) processing.

Let me explain:
Consider the following code fragment from kernel/signal.c in
function do_signal_stop(), released in rhel's 6x beta2 2.6.32-44.2.el6 
kernel:
1707         /*
1708          * If there are no other threads in the group, or if there is
1709          * a group stop in progress and we are the last to stop, report
1710          * to the parent.  When ptraced, every thread reports itself.
1711          */
1712         notify = sig->group_stop_count == 1 ? CLD_STOPPED : 0;
1713         notify = tracehook_notify_jctl(notify, CLD_STOPPED);
1714         /*
1715          * tracehook_notify_jctl() can drop and reacquire siglock, so
1716          * we keep ->group_stop_count != 0 before the call. If SIGCONT
1717          * or SIGKILL comes in between ->group_stop_count == 0.
1718          */
1719         if (sig->group_stop_count) {
1720                 if (!--sig->group_stop_count)
1721                         sig->flags = SIGNAL_STOP_STOPPED;
1722                 current->exit_code = sig->group_exit_code;
1723                 __set_current_state(TASK_STOPPED);
1724         }
1725         spin_unlock_irq(&current->sighand->siglock);
1726
1727         if (notify) {
1728                 read_lock(&tasklist_lock);
1729                 do_notify_parent_cldstop(current, notify);
1730                 read_unlock(&tasklist_lock);
1731         }
1732
1733         /* Now we don't run again until woken by SIGCONT or SIGKILL */
1734         do {
1735                 schedule();
1736         } while (try_to_freeze());
1737
1738         tracehook_finish_jctl();
1739         current->exit_code = 0;
1740
1741         return 1;

For the sake if discussion:

* Let the task group have 2 tasks;
   therefore initially sig->group_stop_count == 2

* For both tasks  task_ptrace(current) returns zero
   (see tracehook_notify_jctl() for why this matters)

* Let task1 be the process group leader and let it be the first task to 
execute
   do_signal_stop()

* Let task1 have a trace engine attached with .report_jctl enabled and let
   all engine ops be no-ops; they do nothing; simply return UTRACE_RESUME

Now when I send a SIGTSTP via ctl-z on the terminal of this multi 
threaded process, the following can happen:

* at line 1713 task1 calls tracehook_notify_jctl() with notify == 0
   because sig->group_stop_count == 2

* Because task1 has a utrace engine with .report_jctl, it releases
   task->sighand->siglock in utrace_report_jctl()

* Now task2 may enter do_signal_stop() with the task->sighand->siglock held.

* For task2 sig->group_stop_count == 2 is still true because task1 is 
either
   off executing utrace code or it is waiting on task->sighand->siglock 
held
   by task2; task1 has not executed line 1720

* For task2 because sig->group_stop_count == 2 and because
   tracehook_notify_jctl(notify, CLD_STOPPED) returns zero, notify == 0

* Therefore when task2 executes line 1727 do_notify_parent_cldstop() is
   not executed.

* After task2 releases the lock, task1 continues, but unfortunately because
   when it was setting the "notify" cookie sig->group_stop_count == 2 and
   tracehook_notify_jctl(notify, CLD_STOPPED) returned zero because 
notify was
   initially zero and task_ptrace(current) returned zero.

* Therefore for task1, after tracehook_notify_jctl(), notify == 0

* Finally, when task1 executes line 1727 do_notify_parent_cldstop() is not
   executed.

The result is a control-z that does not return control to the parent 
because
line 1729 was never executed.  One possible fix is to re-examine
sig->group_stop_count after tracehook_notify_jctl() with something like:

         notify = notify ?: sig->group_stop_count == 1 ? CLD_STOPPED : 0;




More information about the utrace-devel mailing list