From gaspa at yattaweb.it  Thu Jul 26 09:34:44 2007
From: gaspa at yattaweb.it (Andrea Gasparini)
Date: Thu, 26 Jul 2007 11:34:44 +0200
Subject: First post?
Message-ID: <200707261134.44922.gaspa@yattaweb.it>

Hi, 
i'm glad to be the first post in this list.... :-D

Thanks roland.
I'll ask you a bunch of questions soon.

Bye!
-- 
-gaspa-
-----------------------------------------------
--------- Powered by Debian GNU/Linux ---------
------ HomePage: iogaspa.altervista.org -------
-Il lunedi'dell'arrampicatore: www.lunedi.org -


From renzo at cs.unibo.it  Thu Jul 26 09:42:12 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Thu, 26 Jul 2007 11:42:12 +0200
Subject: the first question (about the ML)
Message-ID: <20070726094212.GB9546@cs.unibo.it>

Hi Roland,

it is possible to put a tag on the ML messages? (I mean [utrace-devel]
in the subject)?
I receive a huge amount of mail... I can miss messages without the tag
(I can miss them even with the tag, but at a lower probability ;-)

tnx
	renzo


From roland at redhat.com  Thu Jul 26 09:48:27 2007
From: roland at redhat.com (Roland McGrath)
Date: Thu, 26 Jul 2007 02:48:27 -0700 (PDT)
Subject: the first question (about the ML)
In-Reply-To: Renzo Davoli's message of  Thursday,
	26 July 2007 11:42:12 +0200 <20070726094212.GB9546@cs.unibo.it>
Message-ID: <20070726094827.29E2E4D058D@magilla.localdomain>

I personally despise list tags and they are not the practice on mosts lists
I'm on.  There are the various List- headers and X-loop to make filtering easy.


Thanks,
Roland


From roland at redhat.com  Thu Jul 26 21:08:40 2007
From: roland at redhat.com (Roland McGrath)
Date: Thu, 26 Jul 2007 14:08:40 -0700 (PDT)
Subject: welcome to utrace-devel
Message-ID: <20070726210840.B2B5C4D058D@magilla.localdomain>

Now that several people have joined the list, I'd like to kick
things off with some pointers.

Some past discussions about utrace details have taken place on the
systemtap at sourceware.org and frysk at sourceware.org mailing lists.
If some posts here read like someone started in the middle of an
ongoing discussion, it's because they did.  If you are missing
some context, please look for it in those lists' archives at
http://sourceware.org/ml/systemtap/ and http://sourceware.org/ml/frysk/.

Next, I call your attention to http://sourceware.org/systemtap/wiki/utrace
This is an empty placeholder page now, but can be a place to hang
informal stuff about utrace.  I hope that folks on this list will
want either to edit that wiki or to house a utrace wiki somewhere
else if that's more convenient.  I'm lousy at writing up web and
wiki pages about stuff, but I'm always droning on with details in email.  
It would be great if people cull info from threads here for wiki
pages about the state of things.

Here are a few subjects I imagine people might want to start
talking about.  We have a whole list here, so please start new,
narrow threads with appropriate Subject lines, don't just discuss
everything in followups to this message.


* future work
  http://people.redhat.com/roland/utrace/TODO is a list in terms
  so vague they probably only make sense to me.  That can be
  elaborated, wikified, etc.

* known bugs
  I have a few, I will put together a list.

* arch porting status
  I'm not sure anyone has read utrace-arch-porting-howto.txt lately.  
  Perhaps I should be posting to linux-arch more trying to drum up
  interest in doing ports.

* utrace back into -mm?  (upstream integration status generally)
  I don't really know anything you don't know, but maybe we could know more.


Thanks,
Roland


From gaspa at yattaweb.it  Sat Jul 28 14:51:50 2007
From: gaspa at yattaweb.it (Andrea Gasparini)
Date: Sat, 28 Jul 2007 16:51:50 +0200
Subject: my support in utrace.
Message-ID: <200707281651.50750.gaspa@yattaweb.it>

Hi roland, hi all.

I'm finishing another work that has nothing to do with linux, and in a few 
days i'll be ready to give some contribution.

So, 
- now, i'm searching all discussions about utrace, and i'll add to a wiki 
page. If someone could give link via mailing list, i'll put it on the 
wiki. (disclaimer: i'm quite "untidy", so perhaps someone would sort the 
information... :-P )

- after that: i would take under my hands the uml port of utrace, do you 
know if some work has already done? 

In order to take a look i just imported roland git repository. I found that 
there isn't any master branch (which in all tutorial/howto is the default 
when you clone a repository).
It don't give to me problem, just curious.

Bye!
-- 
-gaspa-
-----------------------------------------------
--------- Powered by Debian GNU/Linux ---------
------ HomePage: iogaspa.altervista.org -------
-Il lunedi'dell'arrampicatore: www.lunedi.org -


From gaspa at yattaweb.it  Sat Jul 28 17:23:27 2007
From: gaspa at yattaweb.it (Andrea Gasparini)
Date: Sat, 28 Jul 2007 19:23:27 +0200
Subject: my support in utrace.
In-Reply-To: <200707281651.50750.gaspa@yattaweb.it>
References: <200707281651.50750.gaspa@yattaweb.it>
Message-ID: <200707281923.27489.gaspa@yattaweb.it>


> - after that: i would take under my hands the uml port of utrace, do you
> know if some work has already done?

I saw that Dike already done some patch in order to make uml work with 
utrace.
( from http://lkml.org/lkml/2007/3/16/407 and the file 
http://people.redhat.com/roland/utrace/2.6-current/utrace-tracehook-um.patch )

But from what I read, it seems that is doens't work (or compile).

So, what's the state of art? 

Thanks, bye!
-- 
-gaspa-
-----------------------------------------------
--------- Powered by Debian GNU/Linux ---------
------ HomePage: iogaspa.altervista.org -------
-Il lunedi'dell'arrampicatore: www.lunedi.org -


From renzo at cs.unibo.it  Sat Jul 28 20:06:04 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Sat, 28 Jul 2007 22:06:04 +0200
Subject: my support in utrace.
In-Reply-To: <200707281923.27489.gaspa@yattaweb.it>
References: <200707281651.50750.gaspa@yattaweb.it>
	<200707281923.27489.gaspa@yattaweb.it>
Message-ID: <20070728200604.GB24588@cs.unibo.it>

> ( from http://lkml.org/lkml/2007/3/16/407 and the file 
> http://people.redhat.com/roland/utrace/2.6-current/utrace-tracehook-um.patch )

AFAIK Jeff's patch just eliminates ptrace code from um arch.
I tried it some time ago, it worked but ptrace was missing.
I don't know if something got broken from the time of the test.

	renzo


From roland at redhat.com  Mon Jul 30 03:22:00 2007
From: roland at redhat.com (Roland McGrath)
Date: Sun, 29 Jul 2007 20:22:00 -0700 (PDT)
Subject: my support in utrace.
In-Reply-To: Andrea Gasparini's message of  Saturday,
	28 July 2007 16:51:50 +0200 <200707281651.50750.gaspa@yattaweb.it>
Message-ID: <20070730032200.609934D058B@magilla.localdomain>

> - after that: i would take under my hands the uml port of utrace, do you 
> know if some work has already done? 

Jeff Dike expressed interest in doing it, but I don't think he has done any
of the regset work.  What's there now (in utrace-tracehook-um.patch) is
just the tracehook branch part of the work, which is enough simply to get
UML to compile and work again with CONFIG_UTRACE=n (and hence, no ptrace).

UML is different from a normal port, because it is not an architecture
itself, but is a different variant port of each other architecture.  The
regset details are already chosen for each arch, and just need to be
implemented compatibly in the UML context.

> In order to take a look i just imported roland git repository. I found that 
> there isn't any master branch (which in all tutorial/howto is the default 
> when you clone a repository).
> It don't give to me problem, just curious.

I am no great expert on git, and even less when I started with it.  The
git.fedoraproject.org servers don't let me do much other than push heads,
so I cannot easily clean up the repository there like I clean up my own
local repositories.  The right thing would be to have a refs/heads/master
there is a symbolic ref for refs/heads/utrace-ptrace-compat.


Thanks,
Roland


From jdike at linux.intel.com  Mon Jul 30 13:48:26 2007
From: jdike at linux.intel.com (jdike)
Date: Mon, 30 Jul 2007 09:48:26 -0400
Subject: my support in utrace.
In-Reply-To: <20070730032200.609934D058B@magilla.localdomain>
Message-ID: <000401c7d2b0$4e332440$be347f0a@amr.corp.intel.com>

> Jeff Dike expressed interest in doing it, but I don't think he has done 
> any of the regset work.  What's there now (in utrace-tracehook-um.patch) 
> is just the tracehook branch part of the work, which is enough simply to 
> get UML to compile and work again with CONFIG_UTRACE=n (and hence, no 
> ptrace).

Correct.

And I welcome anyone doing the UML utrace work :-)  I'll be happy to provide
whatever help is needed.

> UML is different from a normal port, because it is not an architecture
> itself, but is a different variant port of each other architecture.  The
> regset details are already chosen for each arch, and just need to be
> implemented compatibly in the UML context.

UML is its own architecture.  It is definitely not a variant of the
underlying arch.  Probably 90% is the same on all arches, as they all have
the same system call interface.  The remaining part is the architecture
stuff which shows through, like the register set (which ptrace/utrace has to
deal with more than the other subsystems).

				Jeff


From cmoller at redhat.com  Wed Aug  1 04:15:13 2007
From: cmoller at redhat.com (Chris Moller)
Date: Wed, 01 Aug 2007 00:15:13 -0400
Subject: A kernel-module interface to utrace.
Message-ID: <46B008D1.60308@redhat.com>

In case anyone's interested, I've written a fairly substantial beginning
of a module that provides a user-space interface to utrace.  The
interface to the module consists of a collection of /proc/utrace entries
that can be accessed via read(), write(), and ioctl().  The module
supports, via a registration procedure, any number of concurrent clients
such as frysk or any other debugger that needs the capabilities utrace
offers, and for each registered client supports any number of attached
processes.

A present, the module provides a number of basic services, including:

    * Registering and unregistering clients.
    * Attaching and detaching processes for registered clients.
    * Running and quiescing attached processes.
    * Extracting the contents of the register sets supported by utrace
      under the existing architecture.
    * Reading process memory.
    * Extracting the environment strings in effect for attached tasks.
    * Extracting the process memory map

(Some of these services are provided courtesy of utrace, some via other
means available within the kernel.)

For every client process, two /proc entries are created,
/proc/utrace/$PID/cmd, and /proc/utrace/$PID/resp (where $PID is that of
the client process).  The first of these, via write() (or pwrite()) and
ioctl() are used by clients to control the behavior of the module and to
synchronously extract information.  The "resp" entry is intended to be
the file associated with a blocking read() and provides a means by which
asynchronous utrace report_* callbacks can notify the client that
various events (such as syscalls, execs, clones, exits, signals, etc.)
have occurred.  (Use of this capability requires two client threads, one
for control operations, the other to wait for report_* events.)

There are a lot of obvious capabilities that need implementing: writing
regs, writing memory, possibly other capabilities usually provided by
various /proc/$PID entries but more efficiently provided by the module,
plus whatever else people want.  A fairly brief look at uprobes suggests
that those capabilities could be given a user-space interface through
the module.

Source code for the module is available via cvs at
sourceware.org:/cvs/frysk, checkout frysk-utrace.  The structure of the
code is still in flux, but at the moment it's:

    <top>/udb-*.[ch]        source for a fake client to test the module
    <top>/utracer/          source tree for the module and its userspace i/f
    <top>/utracer/module    source for the kernel module itself
    <top>/utracer/utracer   source for the userspace i/f to the module
    <top>/utracer/include   headers for use by by the client

The whole lot, including the fake client, the module, and the module
i/f, can be built by pounding in "./do_make"--I haven't gotten around to
doing anything with automake and the like.  You'll need to have the
kernel headers installed.

There are two ways to access the kernel: directly via read()s, write()s,
and ioctl()s to the module (in the utracer/module directory), or through
the interface library in utracer/utracer.  The i/f lib source can be
mined for examples on how to use the direct fileio i/f; the top-level
udb-*.[ch] (mostly--there's still some direct fileio there, but I'm
slowly migrating away from it) offers examples of the use of the lib i/f.

Keep in mind that this is pre-alpha level code that, being kernel stuff,
may slag your computer completely.  (Though the worst it's ever done to
my machines is lock the kernel up, and even that hasn't happened in a
couple of weeks.)

Another pending i/f I'm planning, and the major reason I implemented the
userspace i/f hiding the fileio, is a Java wrapper around the userspace
i/f--the fileio passes data in C structs which I'm fairly sure Java
doesn't grok.  (Caveat: I'm not a Java guy; see the .sig below.)

At the moment, the fake client is i386-only, but all the arch-specific
stuff is (or should be...) contained in the files udb-i386.[ch].   With
a bit of porting and a brighter configure/build process, it shouldn't be
hard to support other archs.  So far as I know, the module and the
interface lib are arch-independent (at least, any arch-dependent bits
are taken care of by the kernel config of the kernel headers).

Have fun; comments welcome,
Chris Moller.

-- 
Chris Moller

  Java: the blunt scissors of programming languages.
      -- Dave Thomas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/utrace-devel/attachments/20070801/03244b67/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
URL: <http://listman.redhat.com/archives/utrace-devel/attachments/20070801/03244b67/attachment.sig>

From roland at redhat.com  Thu Aug  2 05:55:39 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed,  1 Aug 2007 22:55:39 -0700 (PDT)
Subject: utrace known bugs
Message-ID: <20070802055540.037754D04B9@magilla.localdomain>

Here are outstanding problems (aside from TODO items) I know of off hand.
These range from bug reports I haven't looked into, to subtle old XXX
comments for which I'll have to dredge up memories of the details.
I hope someone likes to wikify or otherwise organize this list somewhere.

* detach vs report race
  Alexey Dobriyan <adobriyan at sw.ru> identified a race viz
  dead_engine_ops vs engine->flags
  Race explanation makes sense, but so far never been reproduced.
  Needs a little thought.

* https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534
  Unknown wedge, maybe stale report

* https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532
  "tkill(SIGCONT) is not reported by waitpid()"
  Unexamined

* ia64 RBS scheme
  This is tortured IA64-specific issue that was known from the start but
  never properly handled.  The original contributors of the IA64 port
  dropped the ball on this part of the implementation.  It needs someone
  with coherent understanding of the ia64 RBS hardware and how the kernel
  uses it, to consult.

* https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002
  crash, needs investigation

* utrace_inject_signal ENOSYS for non-utrace_get_signal case
  Needs implementation with careful synchronization.
  Probably interface details change for "engine interaction" TODO item,
  making implementation story here different.

* ptrace race conditions
  See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle
  for ptrace_state.


Thanks,
Roland


From adobriyan at sw.ru  Thu Aug  2 12:31:12 2007
From: adobriyan at sw.ru (Alexey Dobriyan)
Date: Thu, 2 Aug 2007 16:31:12 +0400
Subject: utrace known bugs
In-Reply-To: <20070802055540.037754D04B9@magilla.localdomain>
References: <20070802055540.037754D04B9@magilla.localdomain>
Message-ID: <20070802123112.GD6553@localhost.sw.ru>

On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote:
> Here are outstanding problems (aside from TODO items) I know of off hand.
> These range from bug reports I haven't looked into, to subtle old XXX
> comments for which I'll have to dredge up memories of the details.
> I hope someone likes to wikify or otherwise organize this list somewhere.
> 
> * detach vs report race
>   Alexey Dobriyan <adobriyan at sw.ru> identified a race viz
>   dead_engine_ops vs engine->flags
>   Race explanation makes sense, but so far never been reproduced.
>   Needs a little thought.
> 
> * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534
>   Unknown wedge, maybe stale report
> 
> * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532
>   "tkill(SIGCONT) is not reported by waitpid()"
>   Unexamined
> 
> * ia64 RBS scheme
>   This is tortured IA64-specific issue that was known from the start but
>   never properly handled.  The original contributors of the IA64 port
>   dropped the ball on this part of the implementation.  It needs someone
>   with coherent understanding of the ia64 RBS hardware and how the kernel
>   uses it, to consult.
> 
> * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002
>   crash, needs investigation
> 
> * utrace_inject_signal ENOSYS for non-utrace_get_signal case
>   Needs implementation with careful synchronization.
>   Probably interface details change for "engine interaction" TODO item,
>   making implementation story here different.
> 
> * ptrace race conditions
>   See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle
>   for ptrace_state.

  * unbounded utrace_engine_cache growth
    started from 31a9ef5cfcdbae804e3e180c158bf2352728765a,
    nobody knows why
    testcase: at the end of  http://marc.info/?l=linux-kernel&m=117128445312243&w=2

  * _pointer_ to struct utrace, which I personally count as design bug.
  
    Rationale to fold struct utrace into task_struct is that lifetime
    rules of task_struct are well established, well tested and so on. As
    was demonstrated it also removes much complexity from attaching logic.

There is one more quick crash in rh bugzilla, but I'll post patch here very soon.


From adobriyan at sw.ru  Thu Aug  2 13:08:41 2007
From: adobriyan at sw.ru (Alexey Dobriyan)
Date: Thu, 2 Aug 2007 17:08:41 +0400
Subject: [PATCH] late checking of permissions during PTRACE_ATTACH
Message-ID: <20070802130841.GE6553@localhost.sw.ru>

ptrace_attach() does permissions check _after_ actual attaching. Given
that utrace_attach is quite non-trivial operation there is no way such
ordering should be allowed -- the following program should crash the box
in less than second.

Something like: ./a.out $(pidof syslogd)

#include <stdlib.h>
#include <sys/ptrace.h>

int main(int argc, char *argv[])
{
	int pid = atoi(argv[1]);

	while (1)
		ptrace(PTRACE_ATTACH, pid, NULL, NULL);
	return 0;
}

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<0000000000000000>]
PGD 17f3ed067 PUD 17ec80067 PMD 0 
Oops: 0010 [1] PREEMPT SMP 
CPU 1 
Modules linked in: rtc
Pid: 400, comm: udevd Not tainted 2.6.23-rc1-utrace #1
RIP: 0010:[<0000000000000000>]  [<0000000000000000>]
RSP: 0000:ffff81017ee4dcb0  EFLAGS: 00010202
RAX: ffffffff803cdbe0 RBX: ffff81017dfd6350 RCX: ffff81017f7a3080
RDX: 0000000000000000 RSI: ffff81017f7a3080 RDI: ffff81017dfd6350
RBP: 0000000000000021 R08: 0000000000000038 R09: ffffffff8025145e
R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000000020
R13: ffff81017f7a3080 R14: ffff81017ee4def8 R15: ffff81017ecca6e0
FS:  00002b98567836d0(0000) GS:ffff81017fc017c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000017f1e9000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process udevd (pid: 400, threadinfo ffff81017ee4c000, task ffff81017f7a3080)
Stack:  ffffffff8025058d ffff81017ecca6f0 ffff81017f7a3080 ffff81017ecca6e0
 0000000000000007 ffff81017ee4dd58 ffff81017ee4def8 ffff81017ee4de78
 ffffffff802506b5 ffff81017ecca700 ffff81017f7a3080 0000000000000007
Call Trace:
 [<ffffffff8025058d>] report_quiescent+0x36/0x13c
 [<ffffffff802506b5>] utrace_quiescent+0x22/0x215
 [<ffffffff80251882>] utrace_get_signal+0x513/0x576
 [<ffffffff802347b1>] get_signal_to_deliver+0x10c/0x3d3
 [<ffffffff8020abe5>] do_notify_resume+0x9c/0x728
 [<ffffffff803b96e0>] _spin_unlock_irqrestore+0x3d/0x69
 [<ffffffff80242d7f>] trace_hardirqs_on+0x116/0x13a
 [<ffffffff803b96ec>] _spin_unlock_irqrestore+0x49/0x69
 [<ffffffff803b8d35>] trace_hardirqs_on_thunk+0x35/0x37
 [<ffffffff80242d7f>] trace_hardirqs_on+0x116/0x13a
 [<ffffffff8020bca6>] retint_signal+0x46/0x90


Code:  Bad RIP value.
RIP  [<0000000000000000>]
 RSP <ffff81017ee4dcb0>
CR2: 0000000000000000
Kernel panic - not syncing: Fatal exception


NOTE, NOTE, NOTE: this is exactly same backtrace as in
"dead_engine_ops vs engine->flags" race, so it's reproducable :)

If by some miracle RHEL5 will also be patched, closes
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=245735

Signed-off-by: Alexey Dobriyan <adobriyan at sw.ru>
---

 kernel/ptrace.c |   42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -327,6 +327,8 @@ static int ptrace_attach(struct task_struct *task)
 		goto bad;
 	if (!task->mm)		/* kernel threads */
 		goto bad;
+	if (!ptrace_may_attach(task))
+		goto bad;
 
 	pr_debug("%d ptrace_attach %d state %lu exit_code %x\n",
 		 current->pid, task->pid, task->state, task->exit_code);
@@ -346,29 +348,27 @@ static int ptrace_attach(struct task_struct *task)
 		 current->pid, task->pid, task->state, task->exit_code);
 
 	NO_LOCKS;
-	if (ptrace_may_attach(task)) {
-		state = ptrace_setup(task, engine, current, 0,
-				     capable(CAP_SYS_PTRACE));
-		if (IS_ERR(state))
-			retval = PTR_ERR(state);
-		else {
-			retval = ptrace_setup_finish(task, state);
+	state = ptrace_setup(task, engine, current, 0,
+			capable(CAP_SYS_PTRACE));
+	if (IS_ERR(state))
+		retval = PTR_ERR(state);
+	else {
+		retval = ptrace_setup_finish(task, state);
 
-			pr_debug("%d ptrace_attach %d after ptrace_update (%d)"
-				 " %lu exit_code %x\n",
-				 current->pid, task->pid, retval,
-				 task->state, task->exit_code);
+		pr_debug("%d ptrace_attach %d after ptrace_update (%d)"
+				" %lu exit_code %x\n",
+				current->pid, task->pid, retval,
+				task->state, task->exit_code);
 
-			if (retval) {
-				/*
-				 * It died before we enabled any callbacks.
-				 */
-				if (retval == -EALREADY)
-					retval = -ESRCH;
-				BUG_ON(retval != -ESRCH);
-				ptrace_state_unlink(state);
-				ptrace_done(state);
-			}
+		if (retval) {
+			/*
+			 * It died before we enabled any callbacks.
+			 */
+			if (retval == -EALREADY)
+				retval = -ESRCH;
+			BUG_ON(retval != -ESRCH);
+			ptrace_state_unlink(state);
+			ptrace_done(state);
 		}
 	}
 	NO_LOCKS;


From ananth at in.ibm.com  Thu Aug  2 13:20:35 2007
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Thu, 2 Aug 2007 18:50:35 +0530
Subject: utrace known bugs
In-Reply-To: <20070802123112.GD6553@localhost.sw.ru>
References: <20070802055540.037754D04B9@magilla.localdomain>
	<20070802123112.GD6553@localhost.sw.ru>
Message-ID: <20070802132035.GA13158@in.ibm.com>

On Thu, Aug 02, 2007 at 04:31:12PM +0400, Alexey Dobriyan wrote:
> On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote:
> > Here are outstanding problems (aside from TODO items) I know of off hand.
> > These range from bug reports I haven't looked into, to subtle old XXX
> > comments for which I'll have to dredge up memories of the details.
> > I hope someone likes to wikify or otherwise organize this list somewhere.
> > 
> > * detach vs report race
> >   Alexey Dobriyan <adobriyan at sw.ru> identified a race viz
> >   dead_engine_ops vs engine->flags
> >   Race explanation makes sense, but so far never been reproduced.
> >   Needs a little thought.
> > 
> > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534
> >   Unknown wedge, maybe stale report
> > 
> > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532
> >   "tkill(SIGCONT) is not reported by waitpid()"
> >   Unexamined
> > 
> > * ia64 RBS scheme
> >   This is tortured IA64-specific issue that was known from the start but
> >   never properly handled.  The original contributors of the IA64 port
> >   dropped the ball on this part of the implementation.  It needs someone
> >   with coherent understanding of the ia64 RBS hardware and how the kernel
> >   uses it, to consult.
> > 
> > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002
> >   crash, needs investigation
> > 
> > * utrace_inject_signal ENOSYS for non-utrace_get_signal case
> >   Needs implementation with careful synchronization.
> >   Probably interface details change for "engine interaction" TODO item,
> >   making implementation story here different.
> > 
> > * ptrace race conditions
> >   See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle
> >   for ptrace_state.
> 
>   * unbounded utrace_engine_cache growth
>     started from 31a9ef5cfcdbae804e3e180c158bf2352728765a,
>     nobody knows why
>     testcase: at the end of  http://marc.info/?l=linux-kernel&m=117128445312243&w=2
> 
>   * _pointer_ to struct utrace, which I personally count as design bug.
>   
>     Rationale to fold struct utrace into task_struct is that lifetime
>     rules of task_struct are well established, well tested and so on. As
>     was demonstrated it also removes much complexity from attaching logic.
> 
> There is one more quick crash in rh bugzilla, but I'll post patch here very soon.

Have you tested this with the latest utrace bits? AFAICS this specific
issue was fixed in June, with this one-liner:

--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -584,7 +584,7 @@ static inline void tracehook_report_deat
 {
        smp_mb();
        if (tsk_utrace_flags(tsk) & (UTRACE_EVENT(DEATH)
-                                    | UTRACE_ACTION_QUIESCE))
+                                    | UTRACE_EVENT(QUIESCE)))
                utrace_report_death(tsk, death_cookie);
 }

Ananth


From adobriyan at sw.ru  Thu Aug  2 14:31:43 2007
From: adobriyan at sw.ru (Alexey Dobriyan)
Date: Thu, 2 Aug 2007 18:31:43 +0400
Subject: utrace known bugs
In-Reply-To: <20070802132035.GA13158@in.ibm.com>
References: <20070802055540.037754D04B9@magilla.localdomain>
	<20070802123112.GD6553@localhost.sw.ru>
	<20070802132035.GA13158@in.ibm.com>
Message-ID: <20070802143143.GF6553@localhost.sw.ru>

On Thu, Aug 02, 2007 at 06:50:35PM +0530, Ananth N Mavinakayanahalli wrote:
> On Thu, Aug 02, 2007 at 04:31:12PM +0400, Alexey Dobriyan wrote:
> > On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote:
> > > Here are outstanding problems (aside from TODO items) I know of off hand.
> > > These range from bug reports I haven't looked into, to subtle old XXX
> > > comments for which I'll have to dredge up memories of the details.
> > > I hope someone likes to wikify or otherwise organize this list somewhere.
> > > 
> > > * detach vs report race
> > >   Alexey Dobriyan <adobriyan at sw.ru> identified a race viz
> > >   dead_engine_ops vs engine->flags
> > >   Race explanation makes sense, but so far never been reproduced.
> > >   Needs a little thought.
> > > 
> > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534
> > >   Unknown wedge, maybe stale report
> > > 
> > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532
> > >   "tkill(SIGCONT) is not reported by waitpid()"
> > >   Unexamined
> > > 
> > > * ia64 RBS scheme
> > >   This is tortured IA64-specific issue that was known from the start but
> > >   never properly handled.  The original contributors of the IA64 port
> > >   dropped the ball on this part of the implementation.  It needs someone
> > >   with coherent understanding of the ia64 RBS hardware and how the kernel
> > >   uses it, to consult.
> > > 
> > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002
> > >   crash, needs investigation
> > > 
> > > * utrace_inject_signal ENOSYS for non-utrace_get_signal case
> > >   Needs implementation with careful synchronization.
> > >   Probably interface details change for "engine interaction" TODO item,
> > >   making implementation story here different.
> > > 
> > > * ptrace race conditions
> > >   See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle
> > >   for ptrace_state.
> > 
> >   * unbounded utrace_engine_cache growth
> >     started from 31a9ef5cfcdbae804e3e180c158bf2352728765a,
> >     nobody knows why
> >     testcase: at the end of  http://marc.info/?l=linux-kernel&m=117128445312243&w=2
> > 
> >   * _pointer_ to struct utrace, which I personally count as design bug.
> >   
> >     Rationale to fold struct utrace into task_struct is that lifetime
> >     rules of task_struct are well established, well tested and so on. As
> >     was demonstrated it also removes much complexity from attaching logic.
> > 
> > There is one more quick crash in rh bugzilla, but I'll post patch here very soon.
> 
> Have you tested this with the latest utrace bits?

struct utrace double free? I can't because of "unbounded utrace_engine_cache growth".
"while (1) ptrace(PTRACE_ATTACH," crash -- yes it was tested with latest utrace.

> AFAICS this specific
> issue was fixed in June, with this one-liner:
> 
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -584,7 +584,7 @@ static inline void tracehook_report_deat
>  {
>         smp_mb();
>         if (tsk_utrace_flags(tsk) & (UTRACE_EVENT(DEATH)
> -                                    | UTRACE_ACTION_QUIESCE))
> +                                    | UTRACE_EVENT(QUIESCE)))
>                 utrace_report_death(tsk, death_cookie);
>  }


From roland at redhat.com  Fri Aug  3 04:16:52 2007
From: roland at redhat.com (Roland McGrath)
Date: Thu,  2 Aug 2007 21:16:52 -0700 (PDT)
Subject: uprobes: keep thread blocked after the handler returns
In-Reply-To: Jim Keniston's message of  Thursday, 2 August 2007 13:57:05 -0700
	<1186088225.4521.52.camel@dyn9047018076.beaverton.ibm.com>
Message-ID: <20070803041652.ECF254D04B5@magilla.localdomain>

[This discussion started on the systemtap at sourceware.org mailing list.]

> On Thu, 2007-08-02 at 14:19 -0700, Roland McGrath wrote:
> > > Why not just prevent the handler from returning until you want the 
> > > thread to unblock? The handler could sleep on something trigger from 
> > > user space.
> > 
> > That is a no-no for a utrace callback.  It prevents other tracing
> > facilities from doing anything useful.
> 
> Yeah, you're pretty adamant about that in Documentation/utrace.txt, but
> I always wondered why.  Using gdb, I can stop a traced app and keep it
> stopped all weekend.  If nobody else is using the app, why should
> anybody care?

You can certainly keep the application stopped all year.
You just have to do it the right way, by setting QUIESCE.

What you can't do is hijack a user thread's scheduling context for
your own purposes.  That is not friendly.  It prevents any other
tracing engine (that is following the rules) from examining the
thread, because it is not properly quiescent, just blocked somewhere
in the kernel.  If other tracing engines detach while you block, it
prevents the data structures being cleaned up, for the same reason.
It prevents anyone poking around the system with ps or whatnot from
seeing a proper "T (tracing stop)" status to indicate why that
process has not progressed all weekend.  It breaks SIGKILL.  If you
do much work, other than block, it charges all those resources to the
user thread instead of explicitly to instrumentation code, which may
or may not matter, but is always relevant to the general issue.

One of the key purposes of utrace is to make it easier not to screw
up in all these ways.  Look, bottom line, you are writing kernel
module code, in an enforcement sense you already have carte blanche.
But if you want me to change my definition of "well-behaved", it
ain't gonna happen.

> Sure, utrace allows multiple facilities to trace the same app, and we
> may see circumstances where multiple users skew each others' results by
> tracing the same app with blocking handlers.  That doesn't seem like a
> very good reason for preventing utrace handlers from doing useful stuff
> that can't be accomplished without blocking.

This misses most of the reasons well-behaved callbacks are important.
(For the executive summary, "breaks SIGKILL" is all anyone really
needs to know.  Any debugging facility on Linux is broken by
definition if it can ever prevent the timely completion of death by SIGKILL.)

It also gives the impression that you think this API discipline for
kernel modules in some way constrains what control you can exercise
over user threads.  That is not so (except you can't break SIGKILL).

The utrace callbacks are the moral equivalent of interrupt handlers,
ones that run with all other interrupts disabled.  The handler utterly
monopolizes the resource in question until it returns, in this case
the thread rather than the CPU.  They are not literally interrupt
handlers, and by design run in a context that is as "safe" as it gets
in kernel mode.  But by analogy, considering the potential to break
the system's behavior overall, they are in that vein.  Think of
writing a tracing engine as like writing a device driver.  The devices
you manipulate are user threads.  You can put the device into a
"frozen" state so it doesn't run off changing its state, but you don't
do it by blocking in the interrupt handler.

If instead of utrace callbacks, I had given you only a queued-event
interface via something like an in-kernel socket, with "keep the
thread quiescent until I dequeue this event and reply" being the only
"immediate action" option, I suspect you would not be complaining.
In either case, you have to structure your code in the same way.  The
main control flow of *your* logic takes place in a context *you*
provide, either the calling thread from the userland debugger doing
some control operation, or a kernel service thread you create.  In
this case, you also have to write the utrace callbacks that queue
events or post wakeups for your main control code, as in a device
interrupt handler.

utrace is the *lowest layer*, it's not *the* interface.  Perhaps most
uses want something somewhere on the spectrum closer to the
queued-event model, i.e. an interface giving a small set of things to
do immediately on an event (that can be implemented in well-behaved
ways) and making higher-level synchronization easy with your main
control interface and with simple things like quiescing multiple threads.
(I intend to produce a higher layer doing those things, but that is
not the utrace layer.)

My analogy to interrupt handlers is slightly dramatic.  In fact,
you can freely do all sorts of things in utrace callbacks.  It's
fine to do small blocks for memory allocation, even paging to
access user memory, maybe even some i/o if you're sure it's quick.
Anything that the thread could do itself from user mode, or in
kernel code that is "instantaneous" (or at least uninterruptible)
from the user-mode perspective.  You can wait for a mutex/sem for
some subsystem data or some lock of your own.  If you have a
locking bug and block forever there, it's not the end of the world.
(It will break SIGKILL, and make the system take forever trying to
do a graceful shutdown when you try to reboot.)  Just no blocking
until Tuesday, nor anything that is not close-by in your direct
control with straightforward reasons to believe it won't ever block
arbitrarily.  Any kind of blocking as an explicit synchronization
mechanism for the thread itself does not qualify.

These are called rules for well-behaved tracing engines.  They are
there to help you mitigate the impact of bugs you can write, be
kind to the environment, play well with others, etc.  If you break
the rules in a prototype, noone will send you to the stockade.


Please help me with suggestions on how Documentation/utrace.txt can
present this issue better.  It seems it's managing to communicate
"Roland doesn't want you to do it", but not why it's good for you,
and for motherhood and apple pie, national security, and global
child welfare, to refrain.


Thanks,
Roland


From shaohua.li at intel.com  Tue Aug  7 05:23:49 2007
From: shaohua.li at intel.com (Shaohua Li)
Date: Tue, 07 Aug 2007 13:23:49 +0800
Subject: [PATCH]utrace: IA64 RSE bug
Message-ID: <1186464229.784.9.camel@sli10-conroe.sh.intel.com>

In ptrace case, user space RSE might be newer than kernel RSE. To avoid
stale RSE is used when return to userspace, this patch synchronize user
space RSE to kernel RSE.
Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is
overrided.

patch is against latest utrace source (2.6.22 base + utrace patches)

Signed-off-by: Bibo Mao<bibo.mao at intel.com>
Signed-off-by: Shaohua Li<shaohua.li at intel.com> 

==============================================================
 arch/ia64/kernel/perfmon.c     |   21 ++--------------
 arch/ia64/kernel/process.c     |   14 +++++++++++
 arch/ia64/kernel/ptrace.c      |   52 +++++++++++++++++++++++++++++++++++++++++
 include/asm-ia64/ptrace.h      |    1 
 include/asm-ia64/thread_info.h |    4 +++
 5 files changed, 74 insertions(+), 18 deletions(-)

Index: 2.6.22/arch/ia64/kernel/process.c
===================================================================
--- 2.6.22.orig/arch/ia64/kernel/process.c	2007-08-07 10:48:04.000000000 +0800
+++ 2.6.22/arch/ia64/kernel/process.c	2007-08-07 11:40:42.000000000 +0800
@@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs)
 		show_stack(NULL, NULL);
 }
 
+void tsk_clear_notify_resume(struct task_struct *tsk)
+{
+#ifdef CONFIG_PERFMON
+	if (tsk->thread.pfm_needs_checking)
+		return;
+#endif
+	if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE))
+		return;
+	clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME);
+}
+
 void
 do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall)
 {
@@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused,
 	/* deal with pending signal delivery */
 	if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK))
 		ia64_do_signal(scr, in_syscall);
+	/* copy user rbs to kernel rbs */
+	if (unlikely(test_thread_flag(TIF_RESTORE_RSE)))
+		ia64_sync_krbs(current);
 }
 
 static int pal_halt        = 1;
Index: 2.6.22/arch/ia64/kernel/ptrace.c
===================================================================
--- 2.6.22.orig/arch/ia64/kernel/ptrace.c	2007-08-07 10:48:04.000000000 +0800
+++ 2.6.22/arch/ia64/kernel/ptrace.c	2007-08-07 12:54:26.000000000 +0800
@@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct *
 	return 0;
 }
 
+long
+ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw,
+                    unsigned long user_rbs_start, unsigned long user_rbs_end)
+{
+        unsigned long addr, val;
+        long ret;
+
+        /* now copy word for word from user rbs to kernel rbs: */
+        for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) {
+                if (access_process_vm(child, addr, &val, sizeof(val), 0)
+                    != sizeof(val))
+                        return -EIO;
+                ret = ia64_poke(child, sw, user_rbs_end, addr, val);
+                if (ret < 0)
+                        return ret;
+        }
+        return 0;
+}
+
 /*
  * Write f32-f127 back to task->thread.fph if it has been modified.
  */
@@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(&regs, 0);
 
+	/* copy user rbs to kernel rbs */
+        if (test_thread_flag(TIF_RESTORE_RSE))
+                ia64_sync_krbs(current);
+
 	if (unlikely(current->audit_context)) {
 		long syscall;
 		int arch;
@@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(&regs, 1);
 
+	/* copy user rbs to kernel rbs */
+        if (test_thread_flag(TIF_RESTORE_RSE))
+                ia64_sync_krbs(current);
+
 	if (test_thread_flag(TIF_SINGLESTEP)) {
 		force_sig(SIGTRAP, current); /* XXX */
 		tracehook_report_syscall_step(&regs);
@@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar
 		 const struct utrace_regset *regset,
 		 int now)
 {
+	if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE))
+		return 0;
+	tsk_set_notify_resume(target);
 	return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL);
 }
 
+static void do_gpregs_readback(struct unw_frame_info *info, void *arg)
+{
+	struct pt_regs *pt;
+	utrace_getset_t *dst = arg;
+	unsigned long urbs_end;
+
+	if (unw_unwind_to_user(info) < 0)
+		return;
+	pt = task_pt_regs(dst->target);
+	urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL);
+	dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end);
+}
+/*
+ * This is called to read back the register backing store.
+ */
+long ia64_sync_krbs(struct task_struct *target)
+{
+	clear_tsk_thread_flag(target, TIF_RESTORE_RSE);
+	tsk_clear_notify_resume(target);
+	return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL);
+}
+
 static int
 fpregs_active(struct task_struct *target, const struct utrace_regset *regset)
 {
Index: 2.6.22/include/asm-ia64/ptrace.h
===================================================================
--- 2.6.22.orig/include/asm-ia64/ptrace.h	2007-08-07 10:48:04.000000000 +0800
+++ 2.6.22/include/asm-ia64/ptrace.h	2007-08-07 10:48:21.000000000 +0800
@@ -292,6 +292,7 @@ struct switch_stack {
 			 unsigned long, long);
   extern void ia64_flush_fph (struct task_struct *);
   extern void ia64_sync_fph (struct task_struct *);
+  extern long ia64_sync_krbs(struct task_struct *);
   extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *,
 				  unsigned long, unsigned long);
 
Index: 2.6.22/include/asm-ia64/thread_info.h
===================================================================
--- 2.6.22.orig/include/asm-ia64/thread_info.h	2007-08-07 10:48:04.000000000 +0800
+++ 2.6.22/include/asm-ia64/thread_info.h	2007-08-07 11:08:45.000000000 +0800
@@ -71,6 +71,9 @@ struct thread_info {
 #define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
 #define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
 
+#define tsk_set_notify_resume(tsk) \
+	set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
+extern void tsk_clear_notify_resume(struct task_struct *tsk);
 #endif /* !__ASSEMBLY */
 
 /*
@@ -86,6 +89,7 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	4	/* syscall auditing active */
 #define TIF_SINGLESTEP		5	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	6	/* restore signal mask in do_signal() */
+#define TIF_RESTORE_RSE		9	/* task need RSE synchronization */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_MEMDIE		17
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
Index: 2.6.22/arch/ia64/kernel/perfmon.c
===================================================================
--- 2.6.22.orig/arch/ia64/kernel/perfmon.c	2007-08-07 10:48:04.000000000 +0800
+++ 2.6.22/arch/ia64/kernel/perfmon.c	2007-08-07 11:01:07.000000000 +0800
@@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task)
 }
 
 static inline void
-pfm_set_task_notify(struct task_struct *task)
-{
-	struct thread_info *info;
-
-	info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE);
-	set_bit(TIF_NOTIFY_RESUME, &info->flags);
-}
-
-static inline void
-pfm_clear_task_notify(void)
-{
-	clear_thread_flag(TIF_NOTIFY_RESUME);
-}
-
-static inline void
 pfm_reserve_page(unsigned long a)
 {
 	SetPageReserved(vmalloc_to_page((void *)a));
@@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar
 
 		PFM_SET_WORK_PENDING(task, 1);
 
-		pfm_set_task_notify(task);
+		tsk_set_notify_resume(task);
 
 		/*
 		 * XXX: send reschedule if task runs on another CPU
@@ -5087,7 +5072,7 @@ pfm_handle_work(void)
 
 	PFM_SET_WORK_PENDING(current, 0);
 
-	pfm_clear_task_notify();
+	tsk_clear_notify_resume(current);
 
 	regs = task_pt_regs(current);
 
@@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct 
 			 * when coming from ctxsw, current still points to the
 			 * previous task, therefore we must work with task and not current.
 			 */
-			pfm_set_task_notify(task);
+			tsk_set_notify_resume(task);
 		}
 		/*
 		 * defer until state is changed (shorten spin window). the context is locked


From roland at redhat.com  Tue Aug  7 07:13:16 2007
From: roland at redhat.com (Roland McGrath)
Date: Tue,  7 Aug 2007 00:13:16 -0700 (PDT)
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: Shaohua Li's message of  Tuesday, 7 August 2007 13:23:49 +0800
	<1186464229.784.9.camel@sli10-conroe.sh.intel.com>
Message-ID: <20070807071316.801494D04C4@magilla.localdomain>

That looks reasonable to me, not being expert on the ia64 details.  
I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport version.

> patch is against latest utrace source (2.6.22 base + utrace patches)

The latest utrace source is based on 2.6.23-rc2, and your patch does not
apply there.  TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK.
So you need to rename it back (or to something else) to overload it.

If the gist of your patch here seems right to ia64 folks, then I'd suggest
you do a patch adding ia64_sync_krbs to the upstream kernel.  That would be
the necessary prelude to doing the arch_ptrace_stop plan, which I can help
ia64 folks with later.  


Thanks,
Roland


From eranian at hpl.hp.com  Tue Aug  7 21:55:31 2007
From: eranian at hpl.hp.com (Stephane Eranian)
Date: Tue, 7 Aug 2007 14:55:31 -0700
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: <20070807071316.801494D04C4@magilla.localdomain>
References: <1186464229.784.9.camel@sli10-conroe.sh.intel.com>
	<20070807071316.801494D04C4@magilla.localdomain>
Message-ID: <20070807215531.GA23261@frankl.hpl.hp.com>

Hello,

On Tue, Aug 07, 2007 at 12:13:16AM -0700, Roland McGrath wrote:
> That looks reasonable to me, not being expert on the ia64 details.  
> I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport version.
> 
> > patch is against latest utrace source (2.6.22 base + utrace patches)
> 
> The latest utrace source is based on 2.6.23-rc2, and your patch does not
> apply there.  TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK.
> So you need to rename it back (or to something else) to overload it.
> 

Yes, in 2.6.23, TIF_NOTIFY_RESUME has been removed because it was not used
except for perfmon, thus I renamed it. The problem is that you cannot just
override it because you would need a way to further demultiplex the
reason for call do_notify_resume(). Perfmon not simply relies  on the
TIF_PERFMON_WORK bit being set. As you realize now this flag would
mean you have some RSE work or some perfmon work. 


-- 
-Stephane


From shaohua.li at intel.com  Wed Aug  8 02:30:37 2007
From: shaohua.li at intel.com (Shaohua Li)
Date: Wed, 08 Aug 2007 10:30:37 +0800
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: <20070807071316.801494D04C4@magilla.localdomain>
References: <20070807071316.801494D04C4@magilla.localdomain>
Message-ID: <1186540237.16482.2.camel@sli10-conroe.sh.intel.com>

On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote:
> That looks reasonable to me, not being expert on the ia64 details. 
> I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport
> version.
> 
> > patch is against latest utrace source (2.6.22 base + utrace patches)
> 
> The latest utrace source is based on 2.6.23-rc2, and your patch does
> not
> apply there.  TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK.
> So you need to rename it back (or to something else) to overload it.
> 
> If the gist of your patch here seems right to ia64 folks, then I'd
> suggest
> you do a patch adding ia64_sync_krbs to the upstream kernel.  That
> would be
> the necessary prelude to doing the arch_ptrace_stop plan, which I can
> help
> ia64 folks with later. 
Ok, this is the patch against 2.6.23-rc2


In ptrace case, user space RSE might be newer than kernel RSE. To avoid
stale RSE is used when return to userspace, this patch synchronize user
space RSE to kernel RSE.
Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is
overrided.

Signed-off-by: Bibo Mao<bibo.mao at intel.com>
Signed-off-by: Shaohua Li<shaohua.li at intel.com>

==============================================================
 arch/ia64/kernel/perfmon.c     |   21 ++--------------
 arch/ia64/kernel/process.c     |   14 +++++++++++
 arch/ia64/kernel/ptrace.c      |   52 +++++++++++++++++++++++++++++++++++++++++
 include/asm-ia64/ptrace.h      |    1 
 include/asm-ia64/thread_info.h |   11 ++++++--
 5 files changed, 78 insertions(+), 21 deletions(-)

Index: 2.6.23-rc2/arch/ia64/kernel/process.c
===================================================================
--- 2.6.23-rc2.orig/arch/ia64/kernel/process.c	2007-08-04 10:49:55.000000000 +0800
+++ 2.6.23-rc2/arch/ia64/kernel/process.c	2007-08-08 09:27:18.000000000 +0800
@@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs)
 		show_stack(NULL, NULL);
 }
 
+void tsk_clear_notify_resume(struct task_struct *tsk)
+{
+#ifdef CONFIG_PERFMON
+	if (tsk->thread.pfm_needs_checking)
+		return;
+#endif
+	if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE))
+		return;
+	clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME);
+}
+
 void
 do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall)
 {
@@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused,
 	/* deal with pending signal delivery */
 	if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK))
 		ia64_do_signal(scr, in_syscall);
+	/* copy user rbs to kernel rbs */
+	if (unlikely(test_thread_flag(TIF_RESTORE_RSE)))
+		ia64_sync_krbs(current);
 }
 
 static int pal_halt        = 1;
Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c
===================================================================
--- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c	2007-08-08 09:00:28.000000000 +0800
+++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c	2007-08-08 09:27:18.000000000 +0800
@@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct *
 	return 0;
 }
 
+long
+ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw,
+                    unsigned long user_rbs_start, unsigned long user_rbs_end)
+{
+        unsigned long addr, val;
+        long ret;
+
+        /* now copy word for word from user rbs to kernel rbs: */
+        for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) {
+                if (access_process_vm(child, addr, &val, sizeof(val), 0)
+                    != sizeof(val))
+                        return -EIO;
+                ret = ia64_poke(child, sw, user_rbs_end, addr, val);
+                if (ret < 0)
+                        return ret;
+        }
+        return 0;
+}
+
 /*
  * Write f32-f127 back to task->thread.fph if it has been modified.
  */
@@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(&regs, 0);
 
+	/* copy user rbs to kernel rbs */
+        if (test_thread_flag(TIF_RESTORE_RSE))
+                ia64_sync_krbs(current);
+
 	if (unlikely(current->audit_context)) {
 		long syscall;
 		int arch;
@@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg
 	if (test_thread_flag(TIF_SYSCALL_TRACE))
 		tracehook_report_syscall(&regs, 1);
 
+	/* copy user rbs to kernel rbs */
+        if (test_thread_flag(TIF_RESTORE_RSE))
+                ia64_sync_krbs(current);
+
 	if (test_thread_flag(TIF_SINGLESTEP)) {
 		force_sig(SIGTRAP, current); /* XXX */
 		tracehook_report_syscall_step(&regs);
@@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar
 		 const struct utrace_regset *regset,
 		 int now)
 {
+	if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE))
+		return 0;
+	tsk_set_notify_resume(target);
 	return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL);
 }
 
+static void do_gpregs_readback(struct unw_frame_info *info, void *arg)
+{
+	struct pt_regs *pt;
+	utrace_getset_t *dst = arg;
+	unsigned long urbs_end;
+
+	if (unw_unwind_to_user(info) < 0)
+		return;
+	pt = task_pt_regs(dst->target);
+	urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL);
+	dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end);
+}
+/*
+ * This is called to read back the register backing store.
+ */
+long ia64_sync_krbs(struct task_struct *target)
+{
+	clear_tsk_thread_flag(target, TIF_RESTORE_RSE);
+	tsk_clear_notify_resume(target);
+	return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL);
+}
+
 static int
 fpregs_active(struct task_struct *target, const struct utrace_regset *regset)
 {
Index: 2.6.23-rc2/include/asm-ia64/ptrace.h
===================================================================
--- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h	2007-08-04 10:49:55.000000000 +0800
+++ 2.6.23-rc2/include/asm-ia64/ptrace.h	2007-08-08 09:27:18.000000000 +0800
@@ -292,6 +292,7 @@ struct switch_stack {
 			 unsigned long, long);
   extern void ia64_flush_fph (struct task_struct *);
   extern void ia64_sync_fph (struct task_struct *);
+  extern long ia64_sync_krbs(struct task_struct *);
   extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *,
 				  unsigned long, unsigned long);
 
Index: 2.6.23-rc2/include/asm-ia64/thread_info.h
===================================================================
--- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h	2007-08-04 10:49:55.000000000 +0800
+++ 2.6.23-rc2/include/asm-ia64/thread_info.h	2007-08-08 09:35:08.000000000 +0800
@@ -71,6 +71,9 @@ struct thread_info {
 #define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
 #define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
 
+#define tsk_set_notify_resume(tsk) \
+	set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
+extern void tsk_clear_notify_resume(struct task_struct *tsk);
 #endif /* !__ASSEMBLY */
 
 /*
@@ -85,28 +88,30 @@ struct thread_info {
 #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
 #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
 #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_PERFMON_WORK	6	/* work for pfm_handle_work() */
+#define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
 #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
 #define TIF_MEMDIE		17
 #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
 #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
 #define TIF_FREEZE		20	/* is freezing for suspend */
+#define TIF_RESTORE_RSE		21	/* task need RSE synchronization */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
 #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
 #define _TIF_SYSCALL_TRACEAUDIT	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP)
 #define _TIF_RESTORE_SIGMASK	(1 << TIF_RESTORE_SIGMASK)
-#define _TIF_PERFMON_WORK	(1 << TIF_PERFMON_WORK)
+#define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_MCA_INIT		(1 << TIF_MCA_INIT)
 #define _TIF_DB_DISABLED	(1 << TIF_DB_DISABLED)
 #define _TIF_FREEZE		(1 << TIF_FREEZE)
+#define _TIF_RESTORE_RSE	(1 << TIF_RESTORE_RSE)
 
 /* "work to do on user-return" bits */
-#define TIF_ALLWORK_MASK	(_TIF_SIGPENDING|_TIF_PERFMON_WORK|_TIF_SYSCALL_AUDIT|\
+#define TIF_ALLWORK_MASK	(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SYSCALL_AUDIT|\
 				 _TIF_NEED_RESCHED| _TIF_SYSCALL_TRACE|\
 				 _TIF_RESTORE_SIGMASK)
 /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */
Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c
===================================================================
--- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c	2007-08-04 10:49:55.000000000 +0800
+++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c	2007-08-08 09:32:41.000000000 +0800
@@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task)
 }
 
 static inline void
-pfm_set_task_notify(struct task_struct *task)
-{
-	struct thread_info *info;
-
-	info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE);
-	set_bit(TIF_PERFMON_WORK, &info->flags);
-}
-
-static inline void
-pfm_clear_task_notify(void)
-{
-	clear_thread_flag(TIF_PERFMON_WORK);
-}
-
-static inline void
 pfm_reserve_page(unsigned long a)
 {
 	SetPageReserved(vmalloc_to_page((void *)a));
@@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar
 
 		PFM_SET_WORK_PENDING(task, 1);
 
-		pfm_set_task_notify(task);
+		tsk_set_notify_resume(task);
 
 		/*
 		 * XXX: send reschedule if task runs on another CPU
@@ -5087,7 +5072,7 @@ pfm_handle_work(void)
 
 	PFM_SET_WORK_PENDING(current, 0);
 
-	pfm_clear_task_notify();
+	tsk_clear_notify_resume(current);
 
 	regs = task_pt_regs(current);
 
@@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct 
 			 * when coming from ctxsw, current still points to the
 			 * previous task, therefore we must work with task and not current.
 			 */
-			pfm_set_task_notify(task);
+			tsk_set_notify_resume(task);
 		}
 		/*
 		 * defer until state is changed (shorten spin window). the context is locked


From eranian at hpl.hp.com  Wed Aug  8 23:10:59 2007
From: eranian at hpl.hp.com (Stephane Eranian)
Date: Wed, 8 Aug 2007 16:10:59 -0700
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: <1186540237.16482.2.camel@sli10-conroe.sh.intel.com>
References: <20070807071316.801494D04C4@magilla.localdomain>
	<1186540237.16482.2.camel@sli10-conroe.sh.intel.com>
Message-ID: <20070808231059.GC25934@frankl.hpl.hp.com>

Hi,

Well, looks like we need to resurect TIF_NOTIFY_RESUME now
that we do have some actual need for it. This implies that
I need to rework my perfmon new code base to take this into 
account. I don't want to use pfm_needs_checking. So maybe
it is okay to use a higher order TIF bit for perfmon on IA-64
combined with TIF_NOTIFY_RESUME.

For the RSE part, I am not sure I understand the logic behind this
change. I think more explanantions maybe needed to calrify this
a bit more.


On Wed, Aug 08, 2007 at 10:30:37AM +0800, Shaohua Li wrote:
> On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote:
> > That looks reasonable to me, not being expert on the ia64 details. 
> > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport
> > version.
> > 
> > > patch is against latest utrace source (2.6.22 base + utrace patches)
> > 
> > The latest utrace source is based on 2.6.23-rc2, and your patch does
> > not
> > apply there.  TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK.
> > So you need to rename it back (or to something else) to overload it.
> > 
> > If the gist of your patch here seems right to ia64 folks, then I'd
> > suggest
> > you do a patch adding ia64_sync_krbs to the upstream kernel.  That
> > would be
> > the necessary prelude to doing the arch_ptrace_stop plan, which I can
> > help
> > ia64 folks with later. 
> Ok, this is the patch against 2.6.23-rc2
> 
> 
> In ptrace case, user space RSE might be newer than kernel RSE. To avoid
> stale RSE is used when return to userspace, this patch synchronize user
> space RSE to kernel RSE.
> Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is
> overrided.
> 
> Signed-off-by: Bibo Mao<bibo.mao at intel.com>
> Signed-off-by: Shaohua Li<shaohua.li at intel.com>
> 
> ==============================================================
>  arch/ia64/kernel/perfmon.c     |   21 ++--------------
>  arch/ia64/kernel/process.c     |   14 +++++++++++
>  arch/ia64/kernel/ptrace.c      |   52 +++++++++++++++++++++++++++++++++++++++++
>  include/asm-ia64/ptrace.h      |    1 
>  include/asm-ia64/thread_info.h |   11 ++++++--
>  5 files changed, 78 insertions(+), 21 deletions(-)
> 
> Index: 2.6.23-rc2/arch/ia64/kernel/process.c
> ===================================================================
> --- 2.6.23-rc2.orig/arch/ia64/kernel/process.c	2007-08-04 10:49:55.000000000 +0800
> +++ 2.6.23-rc2/arch/ia64/kernel/process.c	2007-08-08 09:27:18.000000000 +0800
> @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs)
>  		show_stack(NULL, NULL);
>  }
>  
> +void tsk_clear_notify_resume(struct task_struct *tsk)
> +{
> +#ifdef CONFIG_PERFMON
> +	if (tsk->thread.pfm_needs_checking)
> +		return;
> +#endif
> +	if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE))
> +		return;
> +	clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME);
> +}
> +
>  void
>  do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall)
>  {
> @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused,
>  	/* deal with pending signal delivery */
>  	if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK))
>  		ia64_do_signal(scr, in_syscall);
> +	/* copy user rbs to kernel rbs */
> +	if (unlikely(test_thread_flag(TIF_RESTORE_RSE)))
> +		ia64_sync_krbs(current);
>  }
>  
>  static int pal_halt        = 1;
> Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c
> ===================================================================
> --- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c	2007-08-08 09:00:28.000000000 +0800
> +++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c	2007-08-08 09:27:18.000000000 +0800
> @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct *
>  	return 0;
>  }
>  
> +long
> +ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw,
> +                    unsigned long user_rbs_start, unsigned long user_rbs_end)
> +{
> +        unsigned long addr, val;
> +        long ret;
> +
> +        /* now copy word for word from user rbs to kernel rbs: */
> +        for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) {
> +                if (access_process_vm(child, addr, &val, sizeof(val), 0)
> +                    != sizeof(val))
> +                        return -EIO;
> +                ret = ia64_poke(child, sw, user_rbs_end, addr, val);
> +                if (ret < 0)
> +                        return ret;
> +        }
> +        return 0;
> +}
> +
>  /*
>   * Write f32-f127 back to task->thread.fph if it has been modified.
>   */
> @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg
>  	if (test_thread_flag(TIF_SYSCALL_TRACE))
>  		tracehook_report_syscall(&regs, 0);
>  
> +	/* copy user rbs to kernel rbs */
> +        if (test_thread_flag(TIF_RESTORE_RSE))
> +                ia64_sync_krbs(current);
> +
>  	if (unlikely(current->audit_context)) {
>  		long syscall;
>  		int arch;
> @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg
>  	if (test_thread_flag(TIF_SYSCALL_TRACE))
>  		tracehook_report_syscall(&regs, 1);
>  
> +	/* copy user rbs to kernel rbs */
> +        if (test_thread_flag(TIF_RESTORE_RSE))
> +                ia64_sync_krbs(current);
> +
>  	if (test_thread_flag(TIF_SINGLESTEP)) {
>  		force_sig(SIGTRAP, current); /* XXX */
>  		tracehook_report_syscall_step(&regs);
> @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar
>  		 const struct utrace_regset *regset,
>  		 int now)
>  {
> +	if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE))
> +		return 0;
> +	tsk_set_notify_resume(target);
>  	return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL);
>  }
>  
> +static void do_gpregs_readback(struct unw_frame_info *info, void *arg)
> +{
> +	struct pt_regs *pt;
> +	utrace_getset_t *dst = arg;
> +	unsigned long urbs_end;
> +
> +	if (unw_unwind_to_user(info) < 0)
> +		return;
> +	pt = task_pt_regs(dst->target);
> +	urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL);
> +	dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end);
> +}
> +/*
> + * This is called to read back the register backing store.
> + */
> +long ia64_sync_krbs(struct task_struct *target)
> +{
> +	clear_tsk_thread_flag(target, TIF_RESTORE_RSE);
> +	tsk_clear_notify_resume(target);
> +	return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL);
> +}
> +
>  static int
>  fpregs_active(struct task_struct *target, const struct utrace_regset *regset)
>  {
> Index: 2.6.23-rc2/include/asm-ia64/ptrace.h
> ===================================================================
> --- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h	2007-08-04 10:49:55.000000000 +0800
> +++ 2.6.23-rc2/include/asm-ia64/ptrace.h	2007-08-08 09:27:18.000000000 +0800
> @@ -292,6 +292,7 @@ struct switch_stack {
>  			 unsigned long, long);
>    extern void ia64_flush_fph (struct task_struct *);
>    extern void ia64_sync_fph (struct task_struct *);
> +  extern long ia64_sync_krbs(struct task_struct *);
>    extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *,
>  				  unsigned long, unsigned long);
>  
> Index: 2.6.23-rc2/include/asm-ia64/thread_info.h
> ===================================================================
> --- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h	2007-08-04 10:49:55.000000000 +0800
> +++ 2.6.23-rc2/include/asm-ia64/thread_info.h	2007-08-08 09:35:08.000000000 +0800
> @@ -71,6 +71,9 @@ struct thread_info {
>  #define alloc_task_struct()	((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
>  #define free_task_struct(tsk)	free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
>  
> +#define tsk_set_notify_resume(tsk) \
> +	set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
> +extern void tsk_clear_notify_resume(struct task_struct *tsk);
>  #endif /* !__ASSEMBLY */
>  
>  /*
> @@ -85,28 +88,30 @@ struct thread_info {
>  #define TIF_SYSCALL_AUDIT	3	/* syscall auditing active */
>  #define TIF_SINGLESTEP		4	/* restore singlestep on return to user mode */
>  #define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
> -#define TIF_PERFMON_WORK	6	/* work for pfm_handle_work() */
> +#define TIF_NOTIFY_RESUME	6	/* resumption notification requested */
>  #define TIF_POLLING_NRFLAG	16	/* true if poll_idle() is polling TIF_NEED_RESCHED */
>  #define TIF_MEMDIE		17
>  #define TIF_MCA_INIT		18	/* this task is processing MCA or INIT */
>  #define TIF_DB_DISABLED		19	/* debug trap disabled for fsyscall */
>  #define TIF_FREEZE		20	/* is freezing for suspend */
> +#define TIF_RESTORE_RSE		21	/* task need RSE synchronization */
>  
>  #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
>  #define _TIF_SYSCALL_AUDIT	(1 << TIF_SYSCALL_AUDIT)
>  #define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
>  #define _TIF_SYSCALL_TRACEAUDIT	(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP)
>  #define _TIF_RESTORE_SIGMASK	(1 << TIF_RESTORE_SIGMASK)
> -#define _TIF_PERFMON_WORK	(1 << TIF_PERFMON_WORK)
> +#define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
>  #define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
>  #define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
>  #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
>  #define _TIF_MCA_INIT		(1 << TIF_MCA_INIT)
>  #define _TIF_DB_DISABLED	(1 << TIF_DB_DISABLED)
>  #define _TIF_FREEZE		(1 << TIF_FREEZE)
> +#define _TIF_RESTORE_RSE	(1 << TIF_RESTORE_RSE)
>  
>  /* "work to do on user-return" bits */
> -#define TIF_ALLWORK_MASK	(_TIF_SIGPENDING|_TIF_PERFMON_WORK|_TIF_SYSCALL_AUDIT|\
> +#define TIF_ALLWORK_MASK	(_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SYSCALL_AUDIT|\
>  				 _TIF_NEED_RESCHED| _TIF_SYSCALL_TRACE|\
>  				 _TIF_RESTORE_SIGMASK)
>  /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */
> Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c
> ===================================================================
> --- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c	2007-08-04 10:49:55.000000000 +0800
> +++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c	2007-08-08 09:32:41.000000000 +0800
> @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task)
>  }
>  
>  static inline void
> -pfm_set_task_notify(struct task_struct *task)
> -{
> -	struct thread_info *info;
> -
> -	info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE);
> -	set_bit(TIF_PERFMON_WORK, &info->flags);
> -}
> -
> -static inline void
> -pfm_clear_task_notify(void)
> -{
> -	clear_thread_flag(TIF_PERFMON_WORK);
> -}
> -
> -static inline void
>  pfm_reserve_page(unsigned long a)
>  {
>  	SetPageReserved(vmalloc_to_page((void *)a));
> @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar
>  
>  		PFM_SET_WORK_PENDING(task, 1);
>  
> -		pfm_set_task_notify(task);
> +		tsk_set_notify_resume(task);
>  
>  		/*
>  		 * XXX: send reschedule if task runs on another CPU
> @@ -5087,7 +5072,7 @@ pfm_handle_work(void)
>  
>  	PFM_SET_WORK_PENDING(current, 0);
>  
> -	pfm_clear_task_notify();
> +	tsk_clear_notify_resume(current);
>  
>  	regs = task_pt_regs(current);
>  
> @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct 
>  			 * when coming from ctxsw, current still points to the
>  			 * previous task, therefore we must work with task and not current.
>  			 */
> -			pfm_set_task_notify(task);
> +			tsk_set_notify_resume(task);
>  		}
>  		/*
>  		 * defer until state is changed (shorten spin window). the context is locked
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

-Stephane


From shaohua.li at intel.com  Thu Aug  9 02:35:44 2007
From: shaohua.li at intel.com (Shaohua Li)
Date: Thu, 09 Aug 2007 10:35:44 +0800
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: <20070808231059.GC25934@frankl.hpl.hp.com>
References: <20070807071316.801494D04C4@magilla.localdomain>
	<1186540237.16482.2.camel@sli10-conroe.sh.intel.com>
	<20070808231059.GC25934@frankl.hpl.hp.com>
Message-ID: <1186626944.29125.8.camel@sli10-conroe.sh.intel.com>

On Thu, 2007-08-09 at 07:10 +0800, Stephane Eranian wrote:
> Well, looks like we need to resurect TIF_NOTIFY_RESUME now
> that we do have some actual need for it. This implies that
> I need to rework my perfmon new code base to take this into
> account. I don't want to use pfm_needs_checking. So maybe
> it is okay to use a higher order TIF bit for perfmon on IA-64
> combined with TIF_NOTIFY_RESUME.
Fine.

> For the RSE part, I am not sure I understand the logic behind this
> change. I think more explanantions maybe needed to calrify this
> a bit more.
Basically for a ptraced thread, the patch will copy user RSE to kernel
RSE every time the task will be switched to user space. The reason is
when the thread is stopped (ptraced), debugger might change thread's
user stack, and we must avoid the RSE stored in kernel to override user
stack (user space's RSE is newer in the case), so we copy user RSE to
kernel first and kernel will user the newer RSE to return to user. While
I said we do the synchronization when return to user space, there is an
exception, which is syscall trace enter. debugger might change some
syscall parameters in syscall trace enter, and kernel will use the
parameters, so we do an extra synchronization for the case.

Thanks,
Shaohua

> 
> On Wed, Aug 08, 2007 at 10:30:37AM +0800, Shaohua Li wrote:
> > On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote:
> > > That looks reasonable to me, not being expert on the ia64 details.
> > > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport
> > > version.
> > >
> > > > patch is against latest utrace source (2.6.22 base + utrace
> patches)
> > >
> > > The latest utrace source is based on 2.6.23-rc2, and your patch
> does
> > > not
> > > apply there.  TIF_NOTIFY_RESUME has been renamed to
> TIF_PERFMON_WORK.
> > > So you need to rename it back (or to something else) to overload
> it.
> > >
> > > If the gist of your patch here seems right to ia64 folks, then I'd
> > > suggest
> > > you do a patch adding ia64_sync_krbs to the upstream kernel.  That
> > > would be
> > > the necessary prelude to doing the arch_ptrace_stop plan, which I
> can
> > > help
> > > ia64 folks with later.
> > Ok, this is the patch against 2.6.23-rc2
> >
> >
> > In ptrace case, user space RSE might be newer than kernel RSE. To
> avoid
> > stale RSE is used when return to userspace, this patch synchronize
> user
> > space RSE to kernel RSE.
> > Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is
> > overrided.
> >
> > Signed-off-by: Bibo Mao<bibo.mao at intel.com>
> > Signed-off-by: Shaohua Li<shaohua.li at intel.com>
> >
> > ==============================================================
> >  arch/ia64/kernel/perfmon.c     |   21 ++--------------
> >  arch/ia64/kernel/process.c     |   14 +++++++++++
> >  arch/ia64/kernel/ptrace.c      |   52
> +++++++++++++++++++++++++++++++++++++++++
> >  include/asm-ia64/ptrace.h      |    1
> >  include/asm-ia64/thread_info.h |   11 ++++++--
> >  5 files changed, 78 insertions(+), 21 deletions(-)
> >
> > Index: 2.6.23-rc2/arch/ia64/kernel/process.c
> > ===================================================================
> > --- 2.6.23-rc2.orig/arch/ia64/kernel/process.c        2007-08-04
> 10:49:55.000000000 +0800
> > +++ 2.6.23-rc2/arch/ia64/kernel/process.c     2007-08-08
> 09:27:18.000000000 +0800
> > @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs)
> >               show_stack(NULL, NULL);
> >  }
> > 
> > +void tsk_clear_notify_resume(struct task_struct *tsk)
> > +{
> > +#ifdef CONFIG_PERFMON
> > +     if (tsk->thread.pfm_needs_checking)
> > +             return;
> > +#endif
> > +     if (test_ti_thread_flag(task_thread_info(tsk),
> TIF_RESTORE_RSE))
> > +             return;
> > +     clear_ti_thread_flag(task_thread_info(tsk),
> TIF_NOTIFY_RESUME);
> > +}
> > +
> >  void
> >  do_notify_resume_user (sigset_t *unused, struct sigscratch *scr,
> long in_syscall)
> >  {
> > @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused,
> >       /* deal with pending signal delivery */
> >       if (test_thread_flag(TIF_SIGPENDING)||
> test_thread_flag(TIF_RESTORE_SIGMASK))
> >               ia64_do_signal(scr, in_syscall);
> > +     /* copy user rbs to kernel rbs */
> > +     if (unlikely(test_thread_flag(TIF_RESTORE_RSE)))
> > +             ia64_sync_krbs(current);
> >  }
> > 
> >  static int pal_halt        = 1;
> > Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c
> > ===================================================================
> > --- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c 2007-08-08
> 09:00:28.000000000 +0800
> > +++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c      2007-08-08
> 09:27:18.000000000 +0800
> > @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct *
> >       return 0;
> >  }
> > 
> > +long
> > +ia64_sync_kernel_rbs (struct task_struct *child, struct
> switch_stack *sw,
> > +                    unsigned long user_rbs_start, unsigned long
> user_rbs_end)
> > +{
> > +        unsigned long addr, val;
> > +        long ret;
> > +
> > +        /* now copy word for word from user rbs to kernel rbs: */
> > +        for (addr = user_rbs_start; addr < user_rbs_end; addr += 8)
> {
> > +                if (access_process_vm(child, addr, &val,
> sizeof(val), 0)
> > +                    != sizeof(val))
> > +                        return -EIO;
> > +                ret = ia64_poke(child, sw, user_rbs_end, addr,
> val);
> > +                if (ret < 0)
> > +                        return ret;
> > +        }
> > +        return 0;
> > +}
> > +
> >  /*
> >   * Write f32-f127 back to task->thread.fph if it has been modified.
> >   */
> > @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg
> >       if (test_thread_flag(TIF_SYSCALL_TRACE))
> >               tracehook_report_syscall(&regs, 0);
> > 
> > +     /* copy user rbs to kernel rbs */
> > +        if (test_thread_flag(TIF_RESTORE_RSE))
> > +                ia64_sync_krbs(current);
> > +
> >       if (unlikely(current->audit_context)) {
> >               long syscall;
> >               int arch;
> > @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg
> >       if (test_thread_flag(TIF_SYSCALL_TRACE))
> >               tracehook_report_syscall(&regs, 1);
> > 
> > +     /* copy user rbs to kernel rbs */
> > +        if (test_thread_flag(TIF_RESTORE_RSE))
> > +                ia64_sync_krbs(current);
> > +
> >       if (test_thread_flag(TIF_SINGLESTEP)) {
> >               force_sig(SIGTRAP, current); /* XXX */
> >               tracehook_report_syscall_step(&regs);
> > @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar
> >                const struct utrace_regset *regset,
> >                int now)
> >  {
> > +     if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE))
> > +             return 0;
> > +     tsk_set_notify_resume(target);
> >       return do_regset_call(do_gpregs_writeback, target, regset, 0,
> 0, NULL, NULL);
> >  }
> > 
> > +static void do_gpregs_readback(struct unw_frame_info *info, void
> *arg)
> > +{
> > +     struct pt_regs *pt;
> > +     utrace_getset_t *dst = arg;
> > +     unsigned long urbs_end;
> > +
> > +     if (unw_unwind_to_user(info) < 0)
> > +             return;
> > +     pt = task_pt_regs(dst->target);
> > +     urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL);
> > +     dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw,
> pt->ar_bspstore, urbs_end);
> > +}
> > +/*
> > + * This is called to read back the register backing store.
> > + */
> > +long ia64_sync_krbs(struct task_struct *target)
> > +{
> > +     clear_tsk_thread_flag(target, TIF_RESTORE_RSE);
> > +     tsk_clear_notify_resume(target);
> > +     return do_regset_call(do_gpregs_readback, target, NULL, 0, 0,
> NULL, NULL);
> > +}
> > +
> >  static int
> >  fpregs_active(struct task_struct *target, const struct
> utrace_regset *regset)
> >  {
> > Index: 2.6.23-rc2/include/asm-ia64/ptrace.h
> > ===================================================================
> > --- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h 2007-08-04
> 10:49:55.000000000 +0800
> > +++ 2.6.23-rc2/include/asm-ia64/ptrace.h      2007-08-08
> 09:27:18.000000000 +0800
> > @@ -292,6 +292,7 @@ struct switch_stack {
> >                        unsigned long, long);
> >    extern void ia64_flush_fph (struct task_struct *);
> >    extern void ia64_sync_fph (struct task_struct *);
> > +  extern long ia64_sync_krbs(struct task_struct *);
> >    extern long ia64_sync_user_rbs (struct task_struct *, struct
> switch_stack *,
> >                                 unsigned long, unsigned long);
> > 
> > Index: 2.6.23-rc2/include/asm-ia64/thread_info.h
> > ===================================================================
> > --- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h    2007-08-04
> 10:49:55.000000000 +0800
> > +++ 2.6.23-rc2/include/asm-ia64/thread_info.h 2007-08-08
> 09:35:08.000000000 +0800
> > @@ -71,6 +71,9 @@ struct thread_info {
> >  #define alloc_task_struct()  ((struct task_struct
> *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
> >  #define free_task_struct(tsk)        free_pages((unsigned long)
> (tsk), KERNEL_STACK_SIZE_ORDER)
> > 
> > +#define tsk_set_notify_resume(tsk) \
> > +     set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME)
> > +extern void tsk_clear_notify_resume(struct task_struct *tsk);
> >  #endif /* !__ASSEMBLY */
> > 
> >  /*
> > @@ -85,28 +88,30 @@ struct thread_info {
> >  #define TIF_SYSCALL_AUDIT    3       /* syscall auditing active */
> >  #define TIF_SINGLESTEP               4       /* restore singlestep
> on return to user mode */
> >  #define TIF_RESTORE_SIGMASK  5       /* restore signal mask in
> do_signal() */
> > -#define TIF_PERFMON_WORK     6       /* work for pfm_handle_work()
> */
> > +#define TIF_NOTIFY_RESUME    6       /* resumption notification
> requested */
> >  #define TIF_POLLING_NRFLAG   16      /* true if poll_idle() is
> polling TIF_NEED_RESCHED */
> >  #define TIF_MEMDIE           17
> >  #define TIF_MCA_INIT         18      /* this task is processing MCA
> or INIT */
> >  #define TIF_DB_DISABLED              19      /* debug trap disabled
> for fsyscall */
> >  #define TIF_FREEZE           20      /* is freezing for suspend */
> > +#define TIF_RESTORE_RSE              21      /* task need RSE
> synchronization */
> > 
> >  #define _TIF_SYSCALL_TRACE   (1 << TIF_SYSCALL_TRACE)
> >  #define _TIF_SYSCALL_AUDIT   (1 << TIF_SYSCALL_AUDIT)
> >  #define _TIF_SINGLESTEP              (1 << TIF_SINGLESTEP)
> >  #define _TIF_SYSCALL_TRACEAUDIT      (_TIF_SYSCALL_TRACE|
> _TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP)
> >  #define _TIF_RESTORE_SIGMASK (1 << TIF_RESTORE_SIGMASK)
> > -#define _TIF_PERFMON_WORK    (1 << TIF_PERFMON_WORK)
> > +#define _TIF_NOTIFY_RESUME   (1 << TIF_NOTIFY_RESUME)
> >  #define _TIF_SIGPENDING              (1 << TIF_SIGPENDING)
> >  #define _TIF_NEED_RESCHED    (1 << TIF_NEED_RESCHED)
> >  #define _TIF_POLLING_NRFLAG  (1 << TIF_POLLING_NRFLAG)
> >  #define _TIF_MCA_INIT                (1 << TIF_MCA_INIT)
> >  #define _TIF_DB_DISABLED     (1 << TIF_DB_DISABLED)
> >  #define _TIF_FREEZE          (1 << TIF_FREEZE)
> > +#define _TIF_RESTORE_RSE     (1 << TIF_RESTORE_RSE)
> > 
> >  /* "work to do on user-return" bits */
> > -#define TIF_ALLWORK_MASK     (_TIF_SIGPENDING|_TIF_PERFMON_WORK|
> _TIF_SYSCALL_AUDIT|\
> > +#define TIF_ALLWORK_MASK     (_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|
> _TIF_SYSCALL_AUDIT|\
> >                                _TIF_NEED_RESCHED|
> _TIF_SYSCALL_TRACE|\
> >                                _TIF_RESTORE_SIGMASK)
> >  /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or
> TIF_SYSCALL_AUDIT */
> > Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c
> > ===================================================================
> > --- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c        2007-08-04
> 10:49:55.000000000 +0800
> > +++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c     2007-08-08
> 09:32:41.000000000 +0800
> > @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task)
> >  }
> > 
> >  static inline void
> > -pfm_set_task_notify(struct task_struct *task)
> > -{
> > -     struct thread_info *info;
> > -
> > -     info = (struct thread_info *) ((char *) task +
> IA64_TASK_SIZE);
> > -     set_bit(TIF_PERFMON_WORK, &info->flags);
> > -}
> > -
> > -static inline void
> > -pfm_clear_task_notify(void)
> > -{
> > -     clear_thread_flag(TIF_PERFMON_WORK);
> > -}
> > -
> > -static inline void
> >  pfm_reserve_page(unsigned long a)
> >  {
> >       SetPageReserved(vmalloc_to_page((void *)a));
> > @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar
> > 
> >               PFM_SET_WORK_PENDING(task, 1);
> > 
> > -             pfm_set_task_notify(task);
> > +             tsk_set_notify_resume(task);
> > 
> >               /*
> >                * XXX: send reschedule if task runs on another CPU
> > @@ -5087,7 +5072,7 @@ pfm_handle_work(void)
> > 
> >       PFM_SET_WORK_PENDING(current, 0);
> > 
> > -     pfm_clear_task_notify();
> > +     tsk_clear_notify_resume(current);
> > 
> >       regs = task_pt_regs(current);
> > 
> > @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct
> >                        * when coming from ctxsw, current still
> points to the
> >                        * previous task, therefore we must work with
> task and not current.
> >                        */
> > -                     pfm_set_task_notify(task);
> > +                     tsk_set_notify_resume(task);
> >               }
> >               /*
> >                * defer until state is changed (shorten spin window).
> the context is locked
> > -
> > To unsubscribe from this list: send the line "unsubscribe
> linux-ia64" in
> > the body of a message to majordomo at vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> 
> -Stephane
> 
> 


From renzo at cs.unibo.it  Mon Aug 13 21:02:50 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Mon, 13 Aug 2007 23:02:50 +0200
Subject: [PATCH] some optimizations for Virtual Machines
Message-ID: <20070813210250.GA8985@cs.unibo.it>

Roland (and utrace-devel community),

	I have just completed, together with Andrea Gasparini, a first 
implementation of a kernel module based on utrace as a fast support for our 
virtualization environment (view-os/umview). The name of the module is "kmview" 
kernel-mode-view-os, and the user level tool will have the same name.
We will (GPL) release both module and user level program as soon as possible.

utrace is a wondeful and well designed tool. However, IMHO, during the
implementation of kmview we have found that there are some
improvements that can be done (and that we have already implemented)
for a better support of virtual machines.

Here are some comments, I hope you'll share our ideas and you'll insert 
our improvements soon in utrace's mainstream code.

1- Order of callbacks

You say: Engines are called in the order they attached.
It is meaningful for kernel generated events but unfortunately it
does not provide a significant semantics for engine nesting when
applied to report_syscall_entry.

When dealing with several tracing/virtual machine tools the 
report_syscall_entry callbacks must be evaluated in the reverse way.

As an example I tried to use strace on a view-os like virtual machine 
(some syscalls get virtualized).
strace works but being the last engine it shows the modified calls, but the
return values of the original calls.

Wrong order:
syscall enter: call -> VM (modification) -> strace -> kernel
syscall exit: call -> VM (restore) -> strace -> kernel

Right order:
syscall enter: call -> strace -> VM (modification) -> kernel
syscall exit: call -> VM (restore) -> strace -> kernel

Reversing the attached engine list traversal for syscall_entry solves the
problem.

2- Access to traced process vm.

Your interface provides the call utrace_access_process_vm: it allows tracer 
processes to use /dev/*/mem.
Unfortunately write access is denied (as stated in fs/proc/base.c):
> #define mem_write NULL
> #ifndef mem_write
> /* This is a security hazard */
....
The /dev/*/mem way to access process vm's would be useless anyway.
When I write a virtual machine support for hundreds of processes I cannot
keep hundreds of open files. On the other hand I cannot open and close file
for each memory access: we need fast access!

I propose a new call:
int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string);
which give I-O access to the memory of the process.
It has about the same interface of access_process_vm (mm/memory.c) with
the extra "string" option (significative only when write==0).
Sometimes a read buffer can be significantly larger than the actual field
used for a string. If string==1 the transfer terminates at '\0' avoiding
the memory error that could arise for unallocated memory after the string 
(and a slight increase in performance).
Prior to give access to the process vm, utrace_access_process_vm check
the rights to do so using utrace_allow_access_process_vm (it has the same
degree of protection of your access to /dev/*/mem).

3- In the patch I have also implemented the support for PTRACE_MULTI 
and PTRACE_SYSVM.
These two extra features provide:
-- PTRACE_MULTI: multiple PTRACE operation using one call, including
data transfer of chunks of memory and registers.
(it would speed up many commands, have a look of "strace strace ls",
to see how many bursts of prace could collapse!).
I designed this call for virtual machine support.
-- PTRACE_SYSVM: can be used instead of PTRACE_SYSCALL or SYSEMU.
At the end of the pre-syscall protocol it is possible to choose among three
different behavior:
   i- call againg after the syscall (maybe some parameters gets modified by
	    the virtualization. (like PTRACE_SYSCALL)
   ii- skip the upcall after the syscall but do perform the syscall 
	    (for a non virtualized call)
   iii- skip both the system call and the second upcall event
	    (for a completely virtualized call).
PTRACE_SYSVM almost half the number of context switches for Virtual Machines.
(SYSEMU works just for total Virtual Machines, while SYSVM works also
for partial Virtual Machines)
There is a extensive description of SYSVM in some messages I sent some 
time ago on KDML. We already implemented these features on vanilla kernel, 
this verstion based on utrace is architecture independent.

-------
THe complete patch is here:
http://www.cs.unibo.it/~renzo/utrace/
Unfortunately it is against 2.4.22. I have a very slow connection to the
Internet here, I'll try to update the patch to the latest kernel as
soon as I return home.

ciao
	renzo
-- 
============================================================================
Renzo Davoli				| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE)                 | University of Bologna	
Tel. +39 051 2094501			| Mura Anteo Zamboni, 7
Fax. +39 051 2094510			| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487
============================================================================


From renzo at cs.unibo.it  Mon Aug 13 21:09:29 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Mon, 13 Aug 2007 23:09:29 +0200
Subject: [PATCH] some optimizations for Virtual Machines
In-Reply-To: <20070813210250.GA8985@cs.unibo.it>
References: <20070813210250.GA8985@cs.unibo.it>
Message-ID: <20070813210929.GB8985@cs.unibo.it>

A bugfix for my ASCII ART ;-)

Wrong order (utrace behavior now):
syscall enter: process -VM (modification) -> strace -> kernel
syscall exit: kernel -VM (restore) -> strace -> process

Right order (proposed):
syscall enter: process -> strace -> VM (modification) -> kernel
syscall exit: kernel -> VM (restore) -> strace -> process

Sorry for this trailing errata message.

	renzo


From roland at redhat.com  Tue Aug 14 09:13:15 2007
From: roland at redhat.com (Roland McGrath)
Date: Tue, 14 Aug 2007 02:13:15 -0700 (PDT)
Subject: [PATCH]utrace: IA64 RSE bug
In-Reply-To: Shaohua Li's message of  Wednesday, 8 August 2007 10:30:37 +0800
	<1186540237.16482.2.camel@sli10-conroe.sh.intel.com>
Message-ID: <20070814091315.E199E4D057D@magilla.localdomain>

> > If the gist of your patch here seems right to ia64 folks, then I'd
> > suggest you do a patch adding ia64_sync_krbs to the upstream kernel.
> > That would be the necessary prelude to doing the arch_ptrace_stop plan,
> > which I can help ia64 folks with later.
> Ok, this is the patch against 2.6.23-rc2

That patch is for a kernel that has "tracehook" calls, which is not the
upstream kernel.  If it's for 2.6.23-rc2+linux-2.6.23-rc2-utrace.patch,
that is not what was requested in the bit you just quoted.

To reiterate, I think ia64 folks should consider a version of your patch
for vanilla 2.6.23-rc3.  It sounded like there was some agreement on
re-integrating the flag bit conflicts.  The flag bit integration and the
meat of the patch is not particular to utrace.  If you do the upstream
version of this patch and ia64 folks consider it reasonable, then it can go
into the vanilla ia64 kernel soon.  Can you do that?  As I said before,
that patch would be the precursor to the arch_ptrace_stop support that
makes more cleanup of the vanilla kernel's ia64 ptrace code possible.
(Since this would now be a patch for the vanilla ia64 kernel, it would be
misleading to keep it in a thread with "utrace" in the Subject: line.)

If the low-level parts of the RSE-vs-debugger scheme will be going into the
vanilla kernel, then I would much prefer to have that happen and then have
the utrace ia64 patch be relative to that.


Thanks,
Roland


From renzo at cs.unibo.it  Tue Aug 14 17:41:19 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Tue, 14 Aug 2007 19:41:19 +0200
Subject: [PATCH] update: some optimizations for Virtual Machines
In-Reply-To: <20070813210929.GB8985@cs.unibo.it>
References: <20070813210250.GA8985@cs.unibo.it>
	<20070813210929.GB8985@cs.unibo.it>
Message-ID: <20070814174119.GA8806@cs.unibo.it>

Just a quick note to say that I have updated my patch.

More precisely I have refined the virtualized syscall nesting.

When there are more engines for a task and report-callbacks can
change the status the quiescent state must be managed for each engine.

syscall enter: process -> VM0 -> VM1 (modify) -> VM2 (second
modification) -> kernel
syscall exit: kernel -> VM2 (restore) -> VM1 (restore) -> VM0 -> process

Both for syscall enter and exit the modification of VM2 must take place
after VM1 has completed its job.
If VM1 requires the quiescent state to compute its modification of the state
VM2 report_syscall_entry has to wait for VM1 to finish its job.
The same for restore in the opposite way.

One more idea:
The entry.S code provides the feature to skip the system call (by setting
the syscall number to -1).
This feature must be provided to VM nesting.

The new patch provide the correct nesting and if VM1 (in the example)
sets the syscall number to -1 it skips also VM2 report syscall report.
In this case during syscall exit skips VM2, restore the syscall number
and calls the report_syscall_exit from VM1 and then VM0.
Maybe it is useless to call VM1 too (starting from VM0 given that the 
call has been skipped) but for now I do in this way for the sake of
simmetry.

The new patch implements this policy.

Maybe the same idea (wait for quiescent state at each engine) must be 
applied to all the other report* that can change the status (e.g. 
report_signal?).

renzo


From renzo at cs.unibo.it  Mon Aug 20 13:12:00 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Mon, 20 Aug 2007 15:12:00 +0200
Subject: [PATCH] some optimizations for Virtual Machines
In-Reply-To: <20070813210250.GA8985@cs.unibo.it>
References: <20070813210250.GA8985@cs.unibo.it>
Message-ID: <20070820131200.GC25747@cs.unibo.it>

We have updated the patch to the latest kernel.
It is here:
http://www.cs.unibo.it/~renzo/utrace.

> 1- Order of callbacks
This patch is crucial. Without this change no virtual machines can be
nested if based on utrace.
> 
> 2- Access to traced process vm.
> 
This patch is very important: with this change VM hypervisors can access
their process memory efficiently. 
> 
> I propose a new call:
> int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string);
> which give I-O access to the memory of the process.
I have seen that access_process_vm has been exported to modules: it is a
change included in mainstream 2.6.23 from the first rc.
so I have exported access_process_vm_user too.
access_process_vm_user has two main differences with
access_process_vm:
- it copies a memory area directly from a process vm to the user space
  of current and viceversa (it uses an internal one page buffer, it does
not need extra buffers or extra code loops).
- it supports the "string" flag for reading: no useless copies of data
  after the end of string, no memory errors due to short string read
into large buffers.

utrace_access_process_vm can be kept or not: modules can call it instead
of directly accessing access_process_vm_user when they need to check
that the requesting process has the right to access the other process
vm.
> 
> 3- In the patch I have also implemented the support for PTRACE_MULTI 
> and PTRACE_SYSVM.
This patch is just useful. We'll use it to compare the performance
between umview and kmview. Several ptrace based application could
benefit from these features (e.g. when they need to load chunks of
memory or chunks of registers, burst of ptrace calls could be sent as a
single call reducing the number of mode switches).

That's all for now.


	renzo


From adobriyan at sw.ru  Tue Aug 21 09:00:44 2007
From: adobriyan at sw.ru (Alexey Dobriyan)
Date: Tue, 21 Aug 2007 13:00:44 +0400
Subject: [PATCH] some optimizations for Virtual Machines
In-Reply-To: <20070820131200.GC25747@cs.unibo.it>
References: <20070813210250.GA8985@cs.unibo.it>
	<20070820131200.GC25747@cs.unibo.it>
Message-ID: <20070821090044.GB17761@localhost.sw.ru>

On Mon, Aug 20, 2007 at 03:12:00PM +0200, Renzo Davoli wrote:
> http://www.cs.unibo.it/~renzo/utrace.
> > 2- Access to traced process vm.
> > 
> This patch is very important: with this change VM hypervisors can access
> their process memory efficiently. 
> > 
> > I propose a new call:
> > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string);

"string" smells like a hack, someone will come up with his favourite
structure and ask for flag for copying, say, single-linked lists. :(

> > which give I-O access to the memory of the process.
> I have seen that access_process_vm has been exported to modules: it is a
> change included in mainstream 2.6.23 from the first rc.
> so I have exported access_process_vm_user too.
> access_process_vm_user has two main differences with
> access_process_vm:
> - it copies a memory area directly from a process vm to the user space
>   of current and viceversa (it uses an internal one page buffer,

check for allocation failure

> it does
> not need extra buffers or extra code loops).
> - it supports the "string" flag for reading: no useless copies of data
>   after the end of string, no memory errors due to short string read
> into large buffers.
> 
> utrace_access_process_vm can be kept or not: modules can call it instead
> of directly accessing access_process_vm_user when they need to check
> that the requesting process has the right to access the other process
> vm.
> > 
> > 3- In the patch I have also implemented the support for PTRACE_MULTI 
> > and PTRACE_SYSVM.

PTRACE_MULTI is horrible, it is asking for pain with compat version.

>                            Several ptrace based application could
> benefit from these features (e.g. when they need to load chunks of
> memory or chunks of registers, burst of ptrace calls could be sent as a
> single call reducing the number of mode switches).

Mode switches are fast. This is a reason why read(2) doesn't have
batched version, and read is called waaay more often than ptrace.

I also wonder if this was tested with list debugging on: iterating over
RCU protected list backwards when prev pointers are poisoned shouldn't
work.


From renzo at cs.unibo.it  Tue Aug 21 10:52:04 2007
From: renzo at cs.unibo.it (Renzo Davoli)
Date: Tue, 21 Aug 2007 12:52:04 +0200
Subject: [PATCH] some optimizations for Virtual Machines
In-Reply-To: <20070821090044.GB17761@localhost.sw.ru>
References: <20070813210250.GA8985@cs.unibo.it>
	<20070820131200.GC25747@cs.unibo.it>
	<20070821090044.GB17761@localhost.sw.ru>
Message-ID: <20070821105204.GC28503@cs.unibo.it>

> > > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string);
> 
> "string" smells like a hack, someone will come up with his favourite
> structure and ask for flag for copying, say, single-linked lists. :(

There are no system calls asking for linked lists as parameters.
Instead there are many having string! All those having a pathname.

Look at this:
char *s=strdup(x);
fd=open (s,O_RDONLY);

THis is a quiet, safe chunk of user code.
What do you do to grab the value of s (from a virtual machine monitor)?
With ptrace you can do a loop of PEEK_DATA one for each word of memory,
when there is a NULL byte you leave the loop.
IF you are designing a Virtual Machine this is just a performance
suicide.
You could open /proc/<pid>/mem, in this case either you keep 
one descriptor opened for each controlled process or you  open
/read/close the proc file per each access.
Either a scalability or a performance nightmare.
if you need to go fast something like:
access_process_vm(....,address_of_s, PATH_MAX ...)
can fail because the s could be in the unluky position at the end of
an allocated partition.

> > - it copies a memory area directly from a process vm to the user space
> >   of current and viceversa (it uses an internal one page buffer,
> 
> check for allocation failure
You mean:
actual>  mm = get_task_mm(tsk);
actual>  if (!mm)
actual>    return 0;
actual>
actual>  buf=kmalloc(PAGE_SIZE, GFP_KERNEL);
must be changed as:
updated>  mm = get_task_mm(tsk);
updated>  buf=kmalloc(PAGE_SIZE, GFP_KERNEL);
updated>  if (!mm || !buf)
updated>    return 0;

Okay, you're right, I'll update the patch.

> 
> > > 3- In the patch I have also implemented the support for PTRACE_MULTI 
> > > and PTRACE_SYSVM.
> 
> PTRACE_MULTI is horrible, it is asking for pain with compat version.

I cannot understand. If you do not use PTRACE_MULTI, ptrace works as
usual.
Old ptrace do not use PTRACE_MULTI so you do not need to support
PTRACE_MULTI for backward compatibility with old versions.

> 
> Mode switches are fast. This is a reason why read(2) doesn't have
> batched version, and read is called waaay more often than ptrace.

readv do exist.
Mode switches are fast, but having less mode switches is even faster.
> 
> I also wonder if this was tested with list debugging on: iterating over
> RCU protected list backwards when prev pointers are poisoned shouldn't
> work.
I do not know if the reverse scan of the list can be done better
or maybe my implementation is buggy.

I say that we do need that reverse traversal for SYSCALL_ENTRY otherwise
it is not possible to implement nested services based on utrace.

Regarding the 3 points of my original message:
1- order of calls: the patch (or a different patch implementing the same
idea) is needed, otherwise the support for nested engines 
become meaningless when dealing with system calls virtualization.
2- access to process vm: some solution is needed for a fast access to a
utraced process vm.
3- I need this patch for my project, I feel that it could speed up some
other programs, but this is not so crucial as 1 and 2.

renzo


From roland at redhat.com  Tue Aug 28 07:07:20 2007
From: roland at redhat.com (Roland McGrath)
Date: Tue, 28 Aug 2007 00:07:20 -0700 (PDT)
Subject: utrace known bugs
In-Reply-To: Alexey Dobriyan's message of  Thursday,
	2 August 2007 16:31:12 +0400
	<20070802123112.GD6553@localhost.sw.ru>
Message-ID: <20070828070720.CCF414D04CC@magilla.localdomain>

>   * unbounded utrace_engine_cache growth
>     started from 31a9ef5cfcdbae804e3e180c158bf2352728765a,
>     nobody knows why
>     testcase: at the end of  http://marc.info/?l=linux-kernel&m=117128445312243&w=2

Is it correct that this only happens with CONFIG_PREEMPT=y?

>   * _pointer_ to struct utrace, which I personally count as design bug.
>   
>     Rationale to fold struct utrace into task_struct is that lifetime
>     rules of task_struct are well established, well tested and so on. As
>     was demonstrated it also removes much complexity from attaching logic.

Ok, I'd be happy to discuss that in a separate thread.  By "known bugs" I
mean the symptoms and specific holes in the current implementation.  An
opinion about organizing the code is a fine thing, but not itself an item
for that list.


Thanks,
Roland


From adobriyan at sw.ru  Tue Aug 28 08:48:15 2007
From: adobriyan at sw.ru (Alexey Dobriyan)
Date: Tue, 28 Aug 2007 12:48:15 +0400
Subject: utrace known bugs
In-Reply-To: <20070828070720.CCF414D04CC@magilla.localdomain>
References: <20070802123112.GD6553@localhost.sw.ru>
	<20070828070720.CCF414D04CC@magilla.localdomain>
Message-ID: <20070828084815.GB22417@localhost.sw.ru>

On Tue, Aug 28, 2007 at 12:07:20AM -0700, Roland McGrath wrote:
> >   * unbounded utrace_engine_cache growth
> >     started from 31a9ef5cfcdbae804e3e180c158bf2352728765a,
> >     nobody knows why
> >     testcase: at the end of  http://marc.info/?l=linux-kernel&m=117128445312243&w=2
> 
> Is it correct that this only happens with CONFIG_PREEMPT=y?

I rechecked, and it appears that one needs 2 cpus for this, preemption
doesn't matter. In UP case there is no bug regardless of PREEMPT and in
SMP case there is bug, again, regardless of PREEMPT.


From roland at redhat.com  Tue Aug 28 10:47:40 2007
From: roland at redhat.com (Roland McGrath)
Date: Tue, 28 Aug 2007 03:47:40 -0700 (PDT)
Subject: utrace known bugs
In-Reply-To: Alexey Dobriyan's message of  Tuesday,
	28 August 2007 12:48:15 +0400
	<20070828084815.GB22417@localhost.sw.ru>
Message-ID: <20070828104740.847E84D04CC@magilla.localdomain>

Please edit http://sourceware.org/systemtap/wiki/utrace/bugs with any
updates or additions to the list of specific misbehaviors of the
implementation.  (Feel free to add other pages under wiki/utrace/ to keep
track of any other items of interest, such as implementation design ideas.)


Thanks,
Roland


From roland at redhat.com  Wed Aug 29 10:03:23 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 29 Aug 2007 03:03:23 -0700 (PDT)
Subject: [PATCH] some optimizations for Virtual Machines
In-Reply-To: Renzo Davoli's message of  Monday,
	13 August 2007 23:02:50 +0200 <20070813210250.GA8985@cs.unibo.it>
Message-ID: <20070829100323.0D51E4D04D3@magilla.localdomain>

Thanks very much for your interest in utrace!  

> 1- Order of callbacks

This subject is on the TODO list, part of "Engine interaction issues".
You are not the first to bring it up.  I'd like to discuss the question
more broadly before considering a particular change to the behavior.
I'll do this in a separate thread.  I'd also encourage you to write up
thoughts about things like this on the wiki (make new pages under
http://sourceware.org/systemtap/wiki/utrace/), if you are so inclined.

> 2- Access to traced process vm.

I didn't include any interfaces for this in the utrace core, because
there isn't really any need.  That is, there is no direct interaction
with utrace internals required.  Some utrace-using kernel code can just
do the memory access itself, perhaps after using utrace to get threads
quiescent (and call writeback, for ia64).  You can use access_process_vm,
or do something in more specific code that does what it does, i.e. use
get_user_pages.  In recent versions, access_process_vm is no longer
exported to modules (though maybe it will be again).  But get_user_pages
is, and access_process_vm is a pretty simple function anyway.

I'm not at all against some new, convenient, exported calls for doing
this.  But they do not need to be part of the utrace core.  I think you
should just use whatever is convenient to use in your own code now.  When
there are a few different things satisfying their own very similar needs
like this, then something will gel for new shared interfaces to use.

> 3- In the patch I have also implemented the support for PTRACE_MULTI 
> and PTRACE_SYSVM.

Personally, I am not interested in additions to the ptrace interface at
all.  I don't intend to talk you out of doing it.  I completely
sympathize with your approach of adding what you most need to the
facility that is already working now, and we do not yet have any nice
post-ptrace user-level interface in production.  I just don't care to
comment on the details and merits of any new ptrace requests.  
What I am very interested in are two things about your work.

First, what issues come up with the utrace/tracehook calls in detail in
your work?  (This being things aside from the engine interaction question.)
In particular, tracehook_syscall_* come to mind.  Are those doing what
you need?

Second, what needs/desires/bottlenecks in a higher-level tracing
interface do you think are particular to the VM implementation case?
Doing more operations with fewer context switches is a universal and
well-known need, that anyone designing a new interface is well aware of.
If there are unusual needs or usage patterns peculiar to your kind of
use, I'd like to get more details about those into discussion.  Doing
tracehook_abort_syscall and fetching those details from the registers
portably is the only thing I think of off hand.  For some x86 stuff,
maybe there is segmentation magic to be done too?


In future, I think it would be most useful if you publish your changes as
a series of separate patches (a la quilt).  If you make a utrace core
change, put that in a separate patch.  If you add some utility function,
separate the patch for that.  Put your ptrace additions in a patch that's
just about that.  Then it's easier to review each piece of the puzzle on
its own merits.


Thanks,
Roland


From roland at redhat.com  Wed Aug 29 12:15:10 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 29 Aug 2007 05:15:10 -0700 (PDT)
Subject: engine interaction, callback order
In-Reply-To: Renzo Davoli's message of  Monday,
	13 August 2007 23:02:50 +0200 <20070813210250.GA8985@cs.unibo.it>
Message-ID: <20070829121510.549364D04D3@magilla.localdomain>

Renzo Davoli brought up the issue of the order of callbacks to different
engines for the same utrace event.  I know someone working on uprobes
brought this up before (probably on the systemtap mailing list some time
ago).  

In both cases, the purpose behind the practical interest in callback order
is the same.  Both syscall interception and breakpoint assistance do
emulation or "warping" kinds of things.  In the breakpoint case, it's
useful to execute copied instructions, so the true PC (and other register
values) to be executed in user mode can be changed from the unmolested user
program.  In the syscall case, user register values are tweaked to affect
whether and what actual syscalls can execute.  

These engines temporarily change the user state that executes.  So, you
want them to do their work last, so that other observing engines see the
unmolested user state before they tweak it.  But, you want them to do their
work first, so they restore the user state to the proper values to be
observed before other engines look.

A related issue is an engine wanting the last word on a signal's
disposition.  You may want to be last in report_signal and force a
particular disposition (like terminate) no matter what other engines had
decided about the signal.  Conversely, an engine filling the core dump
function (or starting the bug-reporter robot or whatever) wants to be the
last resort, taking over only if any other engines (active debuggers, etc.)
want to let a fatal signal proceed.  

Whether you want your register values in place only at the last moment (to
be run but not seen), or want your signal disposition decision to go in
effect only at the last moment, the problem is the same.  No order of
callbacks alone helps you.  Another engine can leave QUIESCE set when
you've finished all your callbacks and cleared QUIESCE.  Then your warped
register values are there to be seen by any engine doing asynchronous
regset access while it has QUIESCE set.  Or, the signal disposition to take
effect can be changed by utrace_inject_signal (or perhaps can't, in the
current implementation).  You don't have any way to react to what other
engines are doing.

utrace_inject_signal can't do any queuing--it's injecting the particular
action (possibly fatal) to be taken the very next thing, not a signal to be
considered and dispatched later.  But then what are you supposed to do when
another engine injected first and you get -EBUSY?  There isn't an
opportunity to get your word in there, to examine or replace the competing
disposition scheduled.  

These issues lead to the idea of changing how report_quiesce and
report_signal work.  report_signal tells you "without intervention, user
mode will resume from right here and do this".  Currently report_quiesce
tells you that user mode might resume now if permitted, but also tells you
at some various places just that user mode is not running right now and
it's safe to look.  (At those latter, it's not about to get back to user
mode (or terminate) without passing through some more event points, though
it may be working and blocking nonquiescently before then.)  So perhaps
rename report_signal to report_resume, and call it when dequeuing a signal
and when dequeuing none and preparing to return to user mode after having
stopped for QUIESCE.  That is, it's called when there is the opportunity to
decide the disposition of resuming (fatal or signal handler or normal).
Then do away with utrace_inject_signal entirely.  The EVENT(SIGNAL_*) bits
say an engine wants report_resume for those dispositions, EVENT(QUIESCE)
says the engine wants it no matter what.  An engine uses QUIESCE to get the
thread to call report_resume.  Then every interested engine gets the
callback, and can see the disposition choice left by the last engine, and
whether it's been left in QUIESCE.  

With report_resume as the center of "final user status" cooperation, then
we can think about some sort of callback priority order covering the needs.

For the syscall emulation warping, I think it may be a happy special case
that can sidestep the whole complex engine interaction subject.  It may
make sense to have a first-class utrace feature via return values from
report_syscall callbacks, for skipping a syscall and for restoring register
values to their values before kernel entry.  Rather than an engine using
tracehook_abort_syscall et al, so that other engines can see its changes to
the user registers or pseudo-register, the utrace core would do it after
all callbacks and places engines could stop and look.  


I'd appreciate any feedback anyone has in any of these areas.


Thanks,
Roland


From roland at redhat.com  Thu Aug 30 06:36:49 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 29 Aug 2007 23:36:49 -0700 (PDT)
Subject: [PATCH] late checking of permissions during PTRACE_ATTACH
In-Reply-To: Alexey Dobriyan's message of  Thursday,
	2 August 2007 17:08:41 +0400
	<20070802130841.GE6553@localhost.sw.ru>
Message-ID: <20070830063649.880FE4D04CC@magilla.localdomain>

Thanks very much for your report.  I'm sorry there's been such a delay
before I could follow it up.

> ptrace_attach() does permissions check _after_ actual attaching. Given
> that utrace_attach is quite non-trivial operation there is no way such
> ordering should be allowed -- the following program should crash the box
> in less than second.

utrace_attach is intended to be a reasonably cheap operation.  Anyway, we
are never too concerned about the performance of an error case.  

The reason ptrace_attach does utrace_attach first is to preserve the order
of error diagnoses.  That is, to avoid calling ptrace_may_attach at all if
ptrace_attach is going to fail because someone is already attached.  This
keeps it consistent with vanilla ptrace.  In particular, security_ptrace
should not be called when ptrace_attach fails for the "already attached" or
"already dead" errors.  Whether the call succeeds or fails, it may trigger
security logging or whatnot that should not be done for these cases.

The utrace_attach, utrace_detach sequence in a failing ptrace_attach is
slightly costly.  But it a) shouldn't be too bad and b) is just an error
case.  Moreover, it always ought to work without crashes whether it's a
good idea or not!

Your patch fixes what's not itself a problem, and thereby masks the actual
problem that needs fixing.  Fortunately, I can now reproduce this problem
quickly using your test case.  This is the utrace_detach bug you previously
identified in http://lkml.org/lkml/2007/5/8/244, which is already #1 on the
wiki utrace/bugs list.  I wasn't using a good test case for it before, but
this case hits it.  I think some reasonable fixes are straightforward, and
now I can try one and test it with some confidence using this test case.


Thanks,
Roland


From fche at redhat.com  Wed Sep  5 01:49:32 2007
From: fche at redhat.com (Frank Ch. Eigler)
Date: Tue, 04 Sep 2007 21:49:32 -0400
Subject: external systemtap meeting notes 20070816
In-Reply-To: <mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
	(Jim Keniston's message of "Tue, 04 Sep 2007 15:31:52 -0700")
References: <mailman.8592.1187789486.29635.external-perftools-list@redhat.com>
	<y0mlkc3k23c.fsf@ton.toronto.redhat.com>
	<mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
Message-ID: <y0mtzq9q3tf.fsf@ton.toronto.redhat.com>


jkenisto wrote [untrimmed for forwarding to utrace-devel]:

> [...]
> Uprobes maintains one utrace engine per probed task.  When
> instrumentation modules A and B register uprobes at the same address,
> and that probepoint gets hit, uprobes's signal (SIGTRAP) callback gets
> called (once).  It calls both A's and B's handlers (a la aggregate
> kprobes), then eats the SIGTRAP.
>  
> If A and B each got its own engine, then what would uprobes's callback
> do?  If the callback eats the SIGTRAP, then the 2nd handler never gets
> called.  If it doesn't eat the SIGTRAP, the probed process takes the
> SIGTRAP and dies.  We could work up some new way that A and B could
> coordinate this sort of thing, but then we're back to needing some sort
> of shared (among instrumentation modules) data structures.  And what we
> have now works just fine.

This is probably related to Roland's "engine interaction" note
(http://tinyurl.com/22jt2o).  It would be nice if concurrent engines
could nest transparently - or failing that, at least be aware of each
other enough to get this right.

- FChE


From jkenisto at us.ibm.com  Wed Sep  5 21:50:16 2007
From: jkenisto at us.ibm.com (Jim Keniston)
Date: Wed, 05 Sep 2007 14:50:16 -0700
Subject: engine interaction, callback order
In-Reply-To: <20070829121510.549364D04D3@magilla.localdomain>
References: <20070829121510.549364D04D3@magilla.localdomain>
Message-ID: <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>

On Wed, 2007-08-29 at 05:15 -0700, Roland McGrath wrote:
> Renzo Davoli brought up the issue of the order of callbacks to different
> engines for the same utrace event.  I know someone working on uprobes
> brought this up before (probably on the systemtap mailing list some time
> ago).

Yeah.  I think one concern was that when a breakpoint is hit, a utrace
client other than uprobes might see the SIGTRAP first and take some
unwarranted action because it thinks that the task is going to die.

...
> 
> These issues lead to the idea of changing how report_quiesce and
> report_signal work.

I'm having trouble connecting all the dots in the discussion below.
Maybe if you provide clarifications on the indicated points, I can give
you better feedback.

> report_signal tells you "without intervention, user
> mode will resume from right here and do this".  Currently report_quiesce
> tells you that user mode might resume now if permitted, but also tells you
> at some various places just that user mode is not running right now and
> it's safe to look.  (At those latter, it's not about to get back to user
> mode (or terminate) without passing through some more event points, though
> it may be working and blocking nonquiescently before then.)

I don't understand the above sentence.  In particular, your use of
"though" puzzles me, and I'm not sure what you mean by "before then."
When is "then?"

> So perhaps
> rename report_signal to report_resume, and call it when dequeuing a signal
> and when dequeuing none and preparing to return to user mode after having
> stopped for QUIESCE.

What do you mean by "and when dequeuing none?"

> That is, it's called when there is the opportunity to
> decide the disposition of resuming (fatal or signal handler or normal).
> Then do away with utrace_inject_signal entirely.  The EVENT(SIGNAL_*) bits
> say an engine wants report_resume for those dispositions, EVENT(QUIESCE)
> says the engine wants it no matter what.

Would there still be a report_quiesce callback?  Would utrace call
report_quiesce when entering quiescence and report_resume when leaving
quiescence?

report_signal passes several signal-related args.  Would those also be
passed to report_resume?

> An engine uses QUIESCE to get the
> thread to call report_resume.  Then every interested engine gets the
> callback, and can see the disposition choice left by the last engine,

How?  Encoded in the args to report_resume?

> and
> whether it's been left in QUIESCE.  
> 
> ...  
> 
> 
> I'd appreciate any feedback anyone has in any of these areas.
> 
> 
> Thanks,
> Roland
> 


From roland at redhat.com  Thu Sep  6 01:32:26 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed,  5 Sep 2007 18:32:26 -0700 (PDT)
Subject: engine interaction, callback order
In-Reply-To: Jim Keniston's message of  Wednesday,
	5 September 2007 14:50:16 -0700
	<1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>
References: <20070829121510.549364D04D3@magilla.localdomain>
	<1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>
Message-ID: <20070906013226.82EA94D04D3@magilla.localdomain>

> I'm having trouble connecting all the dots in the discussion below.
> Maybe if you provide clarifications on the indicated points, I can give
> you better feedback.

I'm very glad to do it!  It was a bit of a brain dump, just hoping to get
everything on the table and have a real discussion begin.  (It would be
great if someone wrote up or editted expanded explanations on the wiki.)

> > report_signal tells you "without intervention, user
> > mode will resume from right here and do this".  Currently report_quiesce
> > tells you that user mode might resume now if permitted, but also tells you
> > at some various places just that user mode is not running right now and
> > it's safe to look.  (At those latter, it's not about to get back to user
> > mode (or terminate) without passing through some more event points, though
> > it may be working and blocking nonquiescently before then.)
> 
> I don't understand the above sentence.  In particular, your use of
> "though" puzzles me, and I'm not sure what you mean by "before then."
> When is "then?"

(I won't try to explain my grammar, just my meaning. :-) I was citing the
distinction between the two kinds of places report_quiesce is now called.
The latter category includes places like syscall tracing.  Here
report_quiesce does not mean that the very next thing the thread will do is
return to user mode.  In fact, if UTRACE_EVENT(QUIESCE) remains set, there
will always be another callback (of the former category) before it gets to
user mode.  However ("though"), said second callback ("then") may not occur
very quickly, since the thread (e.g. performing some syscall) may do active
work or uninterruptible kinds of blocking beforehand.

> > So perhaps
> > rename report_signal to report_resume, and call it when dequeuing a signal
> > and when dequeuing none and preparing to return to user mode after having
> > stopped for QUIESCE.
> 
> What do you mean by "and when dequeuing none?"

I mean when there is nothing left to do but return to user mode.  The place
this happens is utrace_get_signal, called from get_signal_to_deliver.
What's going on there is dequeuing all pending signals and acting on them,
and then returning to user mode.  i.e., it's a loop, and the final
iteration is the one that decides there are no pending signals to dequeue.
(If a signal is fatal, the thread dies in the middle of one of those
iterations and never hits the natural end condition of the loop.
If a signal causes job control stop, the thread stops in the middle of one
of those iterations and then picks up in the loop again after SIGCONT.)
Currently report_signal gets called for each iteration but the last.  The
idea above is that it (now called report_resume) would be called for each
iteration, including the last.

> Would there still be a report_quiesce callback?  Would utrace call
> report_quiesce when entering quiescence and report_resume when leaving
> quiescence?
> 
> report_signal passes several signal-related args.  Would those also be
> passed to report_resume?

Yes.  When I said "rename report_signal to report_resume", I meant it
literally and precisely--change the name, not the signature.  The
difference in its calling convention would be that the "action" argument
might initially be UTRACE_ACTION_RESUME, saying that there is no pending
signal at all.  As with report_signal now, a callback could return
UTRACE_SIGNAL_* to say what should happen instead.  In the
no-pending-signal case, it would have to fill in info->si_signo to be
nonzero and set return_ka if it didn't want SIG_DFL, since there will have
been no original signal number there to start with.  (That's the use that
would replace utrace_inject_signal.)

> > An engine uses QUIESCE to get the
> > thread to call report_resume.  Then every interested engine gets the
> > callback, and can see the disposition choice left by the last engine,
> 
> How?  Encoded in the args to report_resume?

Yes, as described above.


Thanks,
Roland


From roland at redhat.com  Thu Sep  6 02:29:59 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed,  5 Sep 2007 19:29:59 -0700 (PDT)
Subject: engine interaction, callback order
In-Reply-To: Roland McGrath's message of  Wednesday,
	5 September 2007 18:32:26 -0700
	<20070906013226.82EA94D04D3@magilla.localdomain>
References: <20070829121510.549364D04D3@magilla.localdomain>
	<1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>
	<20070906013226.82EA94D04D3@magilla.localdomain>
Message-ID: <20070906022959.08F924D04D3@magilla.localdomain>

> > Would there still be a report_quiesce callback?  Would utrace call
> > report_quiesce when entering quiescence and report_resume when leaving
> > quiescence?

Sorry, I didn't answer this part.  To be decided.  Something like that.
Perhaps report_quiesce should go away entirely.  One idea is leaving it
only for the other cases when not about to resume.  Then report_resume
would be called when first arriving at "almost back to user mode", with its
"action" giving UTRACE_ACTION_QUIESCE if the flag is set.  If the
UTRACE_ACTION_QUIESCE state flag is still set after those callbacks, then
we go into quiescence (TASK_TRACED).  When all engines have cleared
QUIESCE, we wake up, and run report_resume again with UTRACE_ACTION_RESUME
(or with any pending signal ready to be delivered).  Or perhaps it's
clearer to leave report_quiesce, have only that run before stopping while
quiescent, and then run report_resume after waking.  (A report_resume
callback can always decide to set QUIESCE again.)  The benefit of the first
idea is that then report_quiesce (if kept at all) means only the "you can
stop, but I'm not near user mode" cases.  If you want to e.g. look at the
user mode registers in their proper form, you need to wait for
report_resume.  So then if you get report_quiesce before report_resume, it
means there may be a delay before the opportunity to get at the registers
freely (e.g. it may be safe to read but not write them at report_quiesce).


Thanks,
Roland


From roland at redhat.com  Thu Sep 13 01:28:06 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 12 Sep 2007 18:28:06 -0700 (PDT)
Subject: resetting MSR_SE in single_step_exception
Message-ID: <20070913012806.BF1754D0412@magilla.localdomain>

Hi.  Can you help me understand why arch/powerpc/kernel/traps.c's
single_step_exception and emulate_single_step functions clear the trace
flags (MSR_SE et al) on every trace trap?

The users of this bit that I know in any detail are ptrace and kprobes.
It looks to me like kprobes does not expect MSR_SE to be cleared
automatically, though I am not entirely certain.  

ptrace does not depend on MSR_SE being cleared to run correctly.  That
is, it always explicitly clears it before resuming for non-stepping
anyway.  It may be relevant that ptrace users do not see the bits set
when in msr they fetch the registers.  But, though a user ptrace can set
the bits while the target is stopped, at least MSR_SE is always cleared
on resuming it.  I'm not aware of what userland debuggers change msr and
what they are doing, if any do.

On x86, the processor single-step flag can also be set by unprivileged
instructions.  User programs do this, and catch their own SIGTRAPs, for
strange purposes.  Is it possible to set MSR_SE/MSR_BE with unprivileged
instructions on powerpc?

Dave Nomura discovered this issue when using my utrace patches on powerpc.
This is why I'm asking.  There is no problem with the current behavior for
ptrace as implemented under utrace, either.  Like vanilla ptrace, in the
utrace implementation, whenever a thread has been stopped for tracing and
then resumes, it will explicitly set or clear single-stepping every time.
However, utrace also allows uses where the single-step event is handled
(recorded, whatever) without stopping; Dave is experimenting with exactly
that.  In this case, the thread never goes through the "stop for tracing
and wake up again" logic.  On other machines, the flag just stays set in
the user registers until explicitly cleared by debugger action.  So that's
what the existing generic utrace logic expects.

I'm tentatively using this short patch (more text below):

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d8502e3..0000000 100644  
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -525,7 +525,8 @@ void RunModeException(struct pt_regs *re
 
 void __kprobes single_step_exception(struct pt_regs *regs)
 {
-	regs->msr &= ~(MSR_SE | MSR_BE);  /* Turn off 'trace' bits */
+	if (!user_mode(regs) || !test_thread_flag(TIF_SINGLESTEP))
+		regs->msr &= ~(MSR_SE | MSR_BE);  /* Turn off 'trace' bits */
 
 	if (notify_die(DIE_SSTEP, "single_step", regs, 5,
 					5, SIGTRAP) == NOTIFY_STOP)
@@ -545,7 +546,8 @@ void __kprobes single_step_exception(str
 static void emulate_single_step(struct pt_regs *regs)
 {
 	if (single_stepping(regs)) {
-		clear_single_step(regs);
+		if (!user_mode(regs) || !test_thread_flag(TIF_SINGLESTEP))
+			clear_single_step(regs);
 		_exception(SIGTRAP, regs, TRAP_TRACE, 0);
 	}
 }


That is a conservative change that won't affect things like kprobes if
they care.  It will affect the msr seen by ptrace after PTRACE_SINGLESTEP,
though not that seen when the PTRACE_SINGLESTEP'd instruction hits another
signal, for example.  

But, I wonder if there is really a need to clear the bits here at all.
Why not let whoever set them be responsible for clearing them?  That's how
we handle the analogous bit on x86 and some others.  It seems good for
overall comprehensibility to make the details of how we use similar
hardware features on different hardware similar when possible.  Unless
there is a reason for it I don't understand, maybe these functions can
change not to touch msr at all?

Another alternative is that I can change utrace so that it always
explicitly enables single-step anew after there has been a trap and
single-step mode was left set.  (In the utrace interface, it's a
persistent thread state, not a one-shot command like PTRACE_SINGLESTEP.)
But if we can (continue to) avoid this extra code path when not stopping,
that seems preferable.

Looking at this made me notice MSR_BE.  The powerpc manual I have says
that implementing branch tracing is optional.  Do common chips in fact
implement it?  How can you tell if it is supported (at compile time or via
runtime chip feature bits)?


Thanks,
Roland


From ananth at in.ibm.com  Thu Sep 13 06:12:20 2007
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Thu, 13 Sep 2007 11:42:20 +0530
Subject: resetting MSR_SE in single_step_exception
In-Reply-To: <20070913012806.BF1754D0412@magilla.localdomain>
References: <20070913012806.BF1754D0412@magilla.localdomain>
Message-ID: <20070913061220.GA14776@in.ibm.com>

On Wed, Sep 12, 2007 at 06:28:06PM -0700, Roland McGrath wrote:
> Hi.  Can you help me understand why arch/powerpc/kernel/traps.c's
> single_step_exception and emulate_single_step functions clear the trace
> flags (MSR_SE et al) on every trace trap?
> 
> The users of this bit that I know in any detail are ptrace and kprobes.
> It looks to me like kprobes does not expect MSR_SE to be cleared
> automatically, though I am not entirely certain.  

On powerpc, since single_step_exception() explicitly resets MSR_SE,
kprobes doesn't reset it again. However, in the post_handler, if we see
the MSR_SE bit set, its deemed that some other user explicitly wants
to singlestep and kprobes returns NOTIFY_DONE so the user who set the
bit can be notified. This also plays nicely if someone is stepping
through the code using xmon.

Ananth


From dcnltc at us.ibm.com  Tue Sep 18 14:28:58 2007
From: dcnltc at us.ibm.com (Dave Nomura)
Date: Tue, 18 Sep 2007 07:28:58 -0700
Subject: utrace on ppc not working
In-Reply-To: <20070913002230.B0AE94D0412@magilla.localdomain>
References: <46E821A4.20206@us.ibm.com>
	<20070913002230.B0AE94D0412@magilla.localdomain>
Message-ID: <46EFE0AA.60808@us.ibm.com>

Roland,
Thanks for the quick response.  I am aware of some of the special instructions that need special care when single stepping on the PPC (but can't say I really understand the details).  

Don't you think this is the sort of thing that utrace single-stepping should handle so that every utrace client doesn't have to figure this out? gdb identifies certain non-steppable instr sequences and does some flow analysis to set a bpt at the first steppable instr.

Or, perhaps this should be done in the kernel, arch/<arch>/kernel/traps.c.  In the arch=powerpc cash it looks like there is code there to emulate single stepping in certain scenarios.  Paul?

Thanks,
Dave


Roland McGrath wrote:
> Yes, I see the problem too and have figured it out.  You were seeing only
> the hits after system calls, and the first hit.  It's fixed by the
> following patch.  I'm going to ask some ppc folks about this before
> deciding for sure this is the right change.
>
> With the fix, your test will go into an infinite loop.  That is "correct"
> behavior, not any bug (but a known issue for single-stepping in general).
> This is because some stdio functions use locking (even if the program is
> not multithreaded), so there is a lwarx, stwcx., bne sequence.  A
> single-step always breaks the lwarx+stwcx. sequence so it has to repeat.
>
> This is a known issue for stepping atomic sequences on several kinds of
> processor.  I have some thoughts relating to that, but it's not an issue
> with any quick fixes.  If you want to discuss that, please do so on the
> utrace-devel at redhat.com mailing list.
>
>
> Thanks,
> Roland
>
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index d8502e3..0000000 100644  
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -525,7 +525,8 @@ void RunModeException(struct pt_regs *re
>
>  void __kprobes single_step_exception(struct pt_regs *regs)
>  {
> -	regs->msr &= ~(MSR_SE | MSR_BE);  /* Turn off 'trace' bits */
> +	if (!test_thread_flag(TIF_SINGLESTEP))
> +		regs->msr &= ~(MSR_SE | MSR_BE);  /* Turn off 'trace' bits */
>
>  	if (notify_die(DIE_SSTEP, "single_step", regs, 5,
>  					5, SIGTRAP) == NOTIFY_STOP)
> @@ -545,7 +546,8 @@ void __kprobes single_step_exception(str
>  static void emulate_single_step(struct pt_regs *regs)
>  {
>  	if (single_stepping(regs)) {
> -		clear_single_step(regs);
> +		if (!test_thread_flag(TIF_SINGLESTEP))
> +			clear_single_step(regs);
>  		_exception(SIGTRAP, regs, TRAP_TRACE, 0);
>  	}
>  }
>   


-- 
Dave Nomura
LTC Linux Power Toolchain


From roland at redhat.com  Wed Sep 26 07:59:58 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 26 Sep 2007 00:59:58 -0700 (PDT)
Subject: tracing-induced signals (engine interaction, callback order)
In-Reply-To: Jim Keniston's message of  Wednesday,
	5 September 2007 14:50:16 -0700
	<1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>
References: <20070829121510.549364D04D3@magilla.localdomain>
	<1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com>
Message-ID: <20070926075958.DF4B94D04B7@magilla.localdomain>

> Yeah.  I think one concern was that when a breakpoint is hit, a utrace
> client other than uprobes might see the SIGTRAP first and take some
> unwarranted action because it thinks that the task is going to die.

There are two components of this issue.  The issue is that tracing
activities induce signals that would not ordinarily be raised in the
program.  Inserted breakpoints do this; so does single-step; so do hw
breakpoint features; etc.

First is the problem of interaction with signals generally.  This is
already a problem in vanilla ptrace.  Induced signals interact with the
normal functioning of the system, not just with competing tracers.  A
straightforward example is when the debugger suddenly dies.  If it had
been using single-step or suchlike, there may be a pending SIGTRAP that
was already posted before the debugger exited and detached, but had not
yet been processed by the debugger.  This is easy to achieve with a
wedged debugger, for example: it hasn't called wait4 to eat the signal
yet, and then it gets killed and never does.  Now the thread that was
formerly traced resumes running, and delivers the SIGTRAP and dumps
core.  Some kernel behavior tweaks to avoid that particular scenario are
easy to come up with.  But that is just the simplest example of the
issue.  More subtle bad effects range from perturbing the signal queue
resource accounting to interfering with the program's own use of blocked
and pending signals.  (Today it's effectively impossible to do some
normal debugging activities with a program that itself catches and
generates SIGTRAP.)

In summary, the first component is distinguishing tracing events from
real signals.  The second problem is engine interaction per se.

My view is that the crux of the former problem is the whole notion of
overloading the signals mechanism for reporting tracing-induced events.
Signals have one essential characteristic we need: they can be
queued/posted quickly and safely at interrupt level, to take effect only
at the safe, unentangled place just before returning to user mode.  My
thinking is to have a mechanism with this property separate from signals.
The "extension events" item on the TODO list is such a mechanism.  That
feature idea is intended to serve a variety of needs.  Among them is the
idea to use these for machine-level events that are being traced.  So
rather than catching signals, an engine inducing traps would register its
interest in the "single-stepped" event or the "breakpoint insn" event.
In cases like the trap for a breakpoint instruction, there is nothing at
the lowest level to distinguish a tracing-induced breakpoint from a
user's other use of those instructions (intended to generate a SIGTRAP).
So this event would be reported to interested tracing engines, and if no
engine consumes it, then it can become a signal.  I think the interface
can be devised such that a detaching engine always gets a callback for
any last events, so it's harder to write an engine that forgets to
swallow the last event it induced and leaves someone to die by SIGTRAP.

The mechanism to support that is a little way off.  But this is how I see
that going.  In that context, engine interaction seems a lot simpler.
Engines that induce low-level events have to sort those out among each
other (probably easy from PC values and such).  Engines interested in
fatal signals only ever see "natural" signals.  Those engines have to
work out among themselves who eats the signal last.  The report_resume
interface and simple priorities for callback order seem like they might
suffice.


Thanks,
Roland


From roland at redhat.com  Wed Sep 26 08:42:29 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 26 Sep 2007 01:42:29 -0700 (PDT)
Subject: external systemtap meeting notes 20070816
In-Reply-To: Frank Ch. Eigler's message of  Tuesday,
	4 September 2007 21:49:32 -0400
	<y0mtzq9q3tf.fsf@ton.toronto.redhat.com>
References: <mailman.8592.1187789486.29635.external-perftools-list@redhat.com>
	<y0mlkc3k23c.fsf@ton.toronto.redhat.com>
	<mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
	<y0mtzq9q3tf.fsf@ton.toronto.redhat.com>
Message-ID: <20070926084229.B17714D04B7@magilla.localdomain>

> > If A and B each got its own engine, then what would uprobes's callback
> > do?  If the callback eats the SIGTRAP, then the 2nd handler never gets
> > called.  If it doesn't eat the SIGTRAP, the probed process takes the
> > SIGTRAP and dies.  We could work up some new way that A and B could
> > coordinate this sort of thing, but then we're back to needing some sort
> > of shared (among instrumentation modules) data structures.  And what we
> > have now works just fine.
> 
> This is probably related to Roland's "engine interaction" note
> (http://tinyurl.com/22jt2o).  It would be nice if concurrent engines
> could nest transparently - or failing that, at least be aware of each
> other enough to get this right.

cf https://www.redhat.com/archives/utrace-devel/2007-September/msg00007.html

The future non-signal mechanism I described there can have a reporting
interface that lets an engine say "I induced this" while letting other
engines still see it, in case two have that same answer for a single
occurrence.  That is, it wouldn't cause any signal if any engines at all
claimed it, yet multiple engines could always see it.  This addresses the
part of the breakpoint interaction problem that Jim described here.

The other part of the problem is insertion/removal.  Naive non-cooperation
works if they literally nest, but not if removal order is not LIFO.  
I don't have any implicit-communication solution for that off hand.


Thanks,
Roland


From roland at redhat.com  Wed Sep 26 10:28:50 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 26 Sep 2007 03:28:50 -0700 (PDT)
Subject: utrace on ppc not working
In-Reply-To: Dave Nomura's message of  Tuesday,
	18 September 2007 07:28:58 -0700 <46EFE0AA.60808@us.ibm.com>
References: <46E821A4.20206@us.ibm.com>
	<20070913002230.B0AE94D0412@magilla.localdomain>
	<46EFE0AA.60808@us.ibm.com>
Message-ID: <20070926102850.DCB0B4D04B7@magilla.localdomain>

> Thanks for the quick response.  I am aware of some of the special
> instructions that need special care when single stepping on the PPC (but
> can't say I really understand the details).

SMP locking is done by sequences of at least three instructions that must
all three execute uninterrupted to succeed.  If the sequence is interrupted
for any reason, it will loop and repeat to try again.  Normal single-step
always interrupts the sequence, so it can never succeed.

> Don't you think this is the sort of thing that utrace single-stepping
> should handle so that every utrace client doesn't have to figure this
> out? gdb identifies certain non-steppable instr sequences and does some
> flow analysis to set a bpt at the first steppable instr.

I wholeheartedly agree that this should be implemented in code shared by
all engines that need it.  It is already on my list.  However, that does
not mean that the lowest-level feature should do that.  The basic simple
semantics (so far always in hardware) is what UTRACE_ACTION_SINGLESTEP
enables. 

> Or, perhaps this should be done in the kernel,
> arch/<arch>/kernel/traps.c.  In the arch=powerpc cash it looks like there
> is code there to emulate single stepping in certain scenarios.  Paul?

When the lowest-level implementation actually works by pure emulation, then
I can see more of an argument for having that do the fancy business
directly.  This is not the case on powerpc.  The code you're referring to
has to do with emulating unsupported chip features and the like, and is far
from a complete emulation.  The basic implementation of single-step is just
the hardware feature.

I have discussed before decomposing the instruction-level work involved in
the SSOOL implementations (a la kprobes and uprobes).  I've talked about
this for the purpose of breakpoint assistance.  What we might call step
assistance will be another beneficiary of that modularization.  For
machines without hardware single-step and where pure emulation is not
desireable or easy enough to implement, step assistance can simulate it.
For machines without hardware block-step, step assistance can simulate
that.  The facility for step assistance can identify atomic-sequence
instructions so a stepper can decide to treat them specially.  Stepping
over an atomic sequence in the desired way is almost the same as block-step
simulation.  In this manner, I anticipate filling the needs for smarter
stepping in a shared component that all clients wanting it will use.

This approach has many complications and entanglements.  It may be
impractical for a very naive client to employ it in all situations
without bad interactions it doesn't understand.  It's not entirely
trivial to implement all the pieces of the puzzle for a particular
architecture.  Special treatment of any instructions (such as atomic
sequences) certainly breaks strict compatibility with what
PTRACE_SINGLESTEP has meant before (even if that seems desireable, it
may have unanticipated consequences for some existing application).
For these reasons, I do not see it being integrated into the lowest
level of the implementation as you suggest.

Pure emulation does not suffer from most of these issues.  Its
performance is probably fine (taking traps is expensive too).
But, on many machines it may be an excessive effort to implement
and maintain with robust handling of all corners of behavior of
all instructions.


Thanks,
Roland


From fche at redhat.com  Wed Sep 26 12:01:15 2007
From: fche at redhat.com (Frank Ch. Eigler)
Date: Wed, 26 Sep 2007 08:01:15 -0400
Subject: external systemtap meeting notes 20070816
In-Reply-To: <20070926084229.B17714D04B7@magilla.localdomain>
References: <mailman.8592.1187789486.29635.external-perftools-list@redhat.com>
	<y0mlkc3k23c.fsf@ton.toronto.redhat.com>
	<mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
	<y0mtzq9q3tf.fsf@ton.toronto.redhat.com>
	<20070926084229.B17714D04B7@magilla.localdomain>
Message-ID: <20070926120115.GT8964@redhat.com>

Hi -

On Wed, Sep 26, 2007 at 01:42:29AM -0700, Roland McGrath wrote:

> [...]  The future non-signal mechanism I described there can have a
> reporting interface [...]  The other part of the problem is
> insertion/removal.  Naive non-cooperation works if they literally
> nest, but not if removal order is not LIFO.  I don't have any
> implicit-communication solution for that off hand.

Yeah, this is roughly why we pointed out some time back that the
utrace layer would be well situated to provide a high-level
breakpoint-related API.

What do you suggest in the interim?

Would this hack work: have the second utrace engine refuse to put a
breakpoint wherever it suspects another engine may have put one?  Or
even more pessimistically, can an engine know that another one is
already monitoring a given target process, and give up at attach time?
(That would defeat some of the promise of utrace, but so it goes.)

- FChE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/utrace-devel/attachments/20070926/b3d2d654/attachment.sig>

From roland at redhat.com  Wed Sep 26 20:33:56 2007
From: roland at redhat.com (Roland McGrath)
Date: Wed, 26 Sep 2007 13:33:56 -0700 (PDT)
Subject: external systemtap meeting notes 20070816
In-Reply-To: Frank Ch. Eigler's message of  Wednesday,
	26 September 2007 08:01:15 -0400
	<20070926120115.GT8964@redhat.com>
References: <mailman.8592.1187789486.29635.external-perftools-list@redhat.com>
	<y0mlkc3k23c.fsf@ton.toronto.redhat.com>
	<mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
	<y0mtzq9q3tf.fsf@ton.toronto.redhat.com>
	<20070926084229.B17714D04B7@magilla.localdomain>
	<20070926120115.GT8964@redhat.com>
Message-ID: <20070926203356.F26164D04B7@magilla.localdomain>

> Would this hack work: have the second utrace engine refuse to put a
> breakpoint wherever it suspects another engine may have put one?  

Sure, it's easy to notice the breakpoint instruction already there.

> Or even more pessimistically, can an engine know that another one is
> already monitoring a given target process, and give up at attach time?
> (That would defeat some of the promise of utrace, but so it goes.)

Yes, it can (UTRACE_ATTACH_EXCLUSIVE).


Thanks,
Roland


From basavarajjm at gmail.com  Thu Sep 27 13:39:35 2007
From: basavarajjm at gmail.com (Basavaraj Mathapati)
Date: Thu, 27 Sep 2007 19:09:35 +0530
Subject: Required utrace patch file for suse linux kernel
Message-ID: <c566ddd70709270639j47dd7583t471eeb264678af31@mail.gmail.com>

Hi,


Iam new to the list

Iam planning to work on uprobes, my system is running with suse linux
(kernel version ?linux-2.6.16.46-0.12, archi-i386  and x86-64) .
I tried with uprobe patch to linux kernel successfully, but not able to
execute the programs since it's stating
that utrace and utracehooks not found.
As my kernel is not with utrace patch applied.
Finally I tried with utrace patch downloaded form the given link
inux-2.6-utrace.patch<http://mail.google.com/mail/ldownloaded%20%20inux-2.6-utrace.patch>
 from http://people.redhat.com/roland/utrace/2.6-current/ ,
but some of hunks succeeded and some failed too.


Failures given below:

patching file fs/binfmt_elf_fdpic.c

Hunk #1 FAILED at 427.

1 out of 1 hunk FAILED -- saving rejects to file fs/binfmt_elf_fdpic.c.rej

patching file fs/binfmt_elf.c

Hunk #1 succeeded at 1003 (offset -33 lines).

patching file fs/binfmt_aout.c

Hunk #1 succeeded at 445 (offset -7 lines).

patching file fs/exec.c

Hunk #2 succeeded at 983 (offset -167 lines).

Hunk #3 succeeded at 1123 (offset -171 lines).

patching file fs/proc/base.c

Hunk #1 FAILED at 67.

Hunk #2 succeeded at 329 with fuzz 1 (offset 178 lines).

Hunk #3 succeeded at 447 with fuzz 2 (offset 214 lines).

Hunk #4 succeeded at 837 with fuzz 2 (offset 275 lines).

Hunk #5 succeeded at 864 (offset 275 lines).

Hunk #6 FAILED at 909.

2 out of 6 hunks FAILED -- saving rejects to file fs/proc/base.c.rej

patching file fs/proc/array.c

Hunk #1 succeeded at 74 (offset -1 lines).

Hunk #2 FAILED at 158.

Hunk #3 FAILED at 178.

Hunk #4 FAILED at 446.

3 out of 4 hunks FAILED -- saving rejects to file fs/proc/array.c.rej

patching file fs/binfmt_som.c

Hunk #1 succeeded at 272 (offset -13 lines).

patching file fs/binfmt_flat.c

Hunk #1 succeeded at 875 (offset -39 lines).

patching file security/selinux/include/objsec.h

Hunk #1 FAILED at 35.

1 out of 1 hunk FAILED -- saving rejects to file
security/selinux/include/objsec.h.rej

patching file security/selinux/hooks.c

Hunk #1 succeeded at 22 (offset -1 lines).

Hunk #2 succeeded at 154 (offset -8 lines).

Hunk #3 succeeded at 1278 (offset -93 lines).

Hunk #4 succeeded at 1704 (offset -143 lines).

Hunk #5 succeeded at 2578 with fuzz 2 (offset -187 lines).

Hunk #6 succeeded at 4094 (offset -453 lines).

Hunk #7 FAILED at 4183.

1 out of 7 hunks FAILED -- saving rejects to file
security/selinux/hooks.c.rej

patching file arch/s390/kernel/ptrace.c

Hunk #2 FAILED at 87.
                                       .........................continues

So can you plz suggest me which utarce and uprobe patch to be applied to the
above linux kernel? With some links to the docs on how to start

Regards
Basavaraj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/utrace-devel/attachments/20070927/a3e6a266/attachment.htm>

From mwielaard at redhat.com  Fri Sep 28 08:36:54 2007
From: mwielaard at redhat.com (Mark Wielaard)
Date: Fri, 28 Sep 2007 10:36:54 +0200
Subject: external systemtap meeting notes 20070816
In-Reply-To: <20070926203356.F26164D04B7@magilla.localdomain>
References: <mailman.8592.1187789486.29635.external-perftools-list@redhat.com>
	<y0mlkc3k23c.fsf@ton.toronto.redhat.com>
	<mailman.3771.1188945116.11517.external-perftools-list@redhat.com>
	<y0mtzq9q3tf.fsf@ton.toronto.redhat.com>
	<20070926084229.B17714D04B7@magilla.localdomain>
	<20070926120115.GT8964@redhat.com>
	<20070926203356.F26164D04B7@magilla.localdomain>
Message-ID: <1190968614.3866.10.camel@dijkstra.wildebeest.org>

On Wed, 2007-09-26 at 13:33 -0700, Roland McGrath wrote:
> > Would this hack work: have the second utrace engine refuse to put a
> > breakpoint wherever it suspects another engine may have put one?  
> 
> Sure, it's easy to notice the breakpoint instruction already there.

Would the second engine then also get an event notification when the
existing breakpoint gets hit? If so then it would be nice to have a
mechanism to detect a breakpoint being removed so an engine could say
"wait! I am still interested in monitoring when that address gets hit,
please keep that breakpoint there", and then just have the first engine
not be notified of any breakpoint events anymore, and only notify the
second engine from then on. (This is kind of what we do in Frysk, but
there of course we do all breakpoint handling in user space through one
ptrace wait monitoring thread that dispatches the events, so we are in
control of all breakpoint setting and breakpoint hit event detection.)

Cheers,

Mark