Q: ->attaching && REPORT_CALLBACKS()

Wed Mar 11 00:11:37 UTC 2009

> The comment:
> 
> 	* When target == current, it would be safe just to call
> 	* splice_attaching() right here.  But if we're inside a
> 	* callback,
> 
> just to clarify, "inside a callback" means inside utrace_report_xxx(),
> not only inside utrace_engine_ops->report_xxx(), right?

Certainly what I mean when I say "a callback" is one of the functions whose
pointer lives in struct utrace_engine_ops.  But I don't see how the
distinction you make could even be meaningful here.  A utrace_attach_task()
call "inside utrace_report_foo()" could only possibly mean one made by a
->report_foo() function utrace_report_foo() calls, since obviously there
are no hard-wired utrace_attach_task() calls in utrace.c itself.

> 	            that would mean the new engine also gets
> 	* notified about the event that precipitated its own
> 	* creation.
> 
> engine->flags == 0, so it should not be notified until the caller
> does utrace_set_events() later, right?

Right.  The case in question is a callback doing:

	new_engine = utrace_attach_task(current, ...);
	utrace_set_events(new_engine, <includes this event>);

Then new_engine would get the "this event" callback at the end of the very
same reporting loop containing its creation.  (This is what happened before
I changed the code as the comment describes.)

> 	             This is not what the user wants.
> 
> It it not clear to me why the user doesn't want this.

Jim Keniston is the user who doesn't want this.
https://www.redhat.com/archives/utrace-devel/2008-December/msg00051.html

> I understand this as follows. If we add the new engine to the ->attached
> list, and if the target is inside a callback, the target can later race
> with (say) utrace_set_events(). The target can see "engine->flags & event"
> and call start_callback/finish_callback before utrace_set_events() completes.

It's not a race question.  There are no guarantees for such races.  (That
is, utrace_set_events() calls on target!=current make no guarantee about
reporting an event that might already have started.  Only if the target was
already stopped by your engine when you made the call can you be sure that
no such event can be in progress.)

The scenario we are talking about here is fully synchronous.  The target
itself is inside a callback, calling utrace_set_events() on itself.

> Another question. In any case I don't understand why do we really need
> two lists.

We want that in the common case a reporting pass takes no locks.  The
->attached list is never touched when the target is not quiescent.  (The
target uses the lock to synchronize when it transitions between being
quiescent and not.)  Once you believe the quiescence logic, this makes
it easy to be confident about the unlocked use of that list in reporting
passes.  It's used in totally vanilla ways, and modified in totally
vanilla ways.

> Let's suppose we implement the new trivial helper,
> 
> 	list_for_each_entry_xxx(pos, head, tail, member)
> 
> it stops when "pos" reaches "tail", not "head". Then REPORT_CALLBACKS()
> can just read "tail = utrace->attached->prev" (under ->lock, or
> utrace_add_engine() can use list_add_rcu) before list_for_each_entry_xxx.
> 
> This way we can kill ->attaching, no?

This is a lot like what the old utrace code did before the introduction
of the two lists (when the engine struct was managed using RCU).  This
is just an optimization over what we have now.  It saves the ->attaching
space (i.e. two words in struct utrace), the splice_attaching() logic
(pretty cheap), and the sometimes-superfluous resume report after an
attach.  The cost for this is some very touchy fine-grained complexity
in convincing ourselves (and reviewers) that the list traversal and
modification is always correct.

I've already implied that anything taking any locks for every vanilla
reporting pass is a non-starter.  I'm asserting preemptive optimization
here because it's the case that is most important to optimize.  The
overhead of a reporting pass applies to situations like every system
call entry, with an engine callback that quickly filters out the vast
majority of calls (i.e. "if (regs->foo != __NR_bar) return;" or
something about that cheap).  So we think about the reporting pass
overhead as something that might be done a million times a second, and
accordingly think carefully about that hot path.  In contrast, we are
talking here about optimizing attach, and saving a couple of words of
data structure space that will already be cache-hot.

We are not actually following any RCU rules at all, so to use
list_add_tail_rcu would really just mean that we are relying on our own
fancy special list mutation scheme and proving/documenting that it is
correct.  It just happens to have the same implementation details as
list_add_tail_rcu, and we must either copy those innards and document
why they are right in our uses, or document how the list_add_tail_rcu
innards happen to match what is right for our uses and keep track of any
future implementation changes in rculist.h that might diverge from what
we rely on.  That proof and documentation entails hairy logic about SMP
ordering and memory barriers and so forth.  Frankly, that all seems like
much more touchy hair than the utrace-indirect logic (for less benefit),
and we've already decided to avoid that for the first cut because LKML
reviewers found it too hairy to contemplate.

Next, consider that e.g. Renzo Davoli has proposed reversing the engine
order used for certain reporting passes (syscall entry vs exit having
inverse order).  (I'm not discussing the merits of that change, it's
just an example.)  Right now, a change like that would be a simple
choice about the desireable API, with no implementation complexity to
worry about at all, just s/list_for_each/&_reverse/.  In the same vein,
we contemplate for the future having engines on a priority list or some
other means to add new engines somewhere other than at the end of the
list.  When we've resolved what interfaces we want for that, it will be
straightforward to implement whatever it is using normal list.h calls
(or even to use a different kind of list data structure).  Many choices
like those are likely to conflict with what clever SMP-safe list magic
we can do now if we start on that sort of optimization now.

I could easily be quite wrong about the performance trade-offs of a lock
vs splice_attaching() and cache effects, etc.  But before we get to
worrying about that performance in great detail, the complexity argument
stands.  This is an optimization to consider later on, both after the
upstream review has accepted the simpler code into the kernel to begin
with, and after we have gotten a more mature set of uses of the API and
refined the details of the API semantics on merits broader than such
micro-optimization.

Thanks,
Roland