From gaspa at yattaweb.it Thu Jul 26 09:34:44 2007 From: gaspa at yattaweb.it (Andrea Gasparini) Date: Thu, 26 Jul 2007 11:34:44 +0200 Subject: First post? Message-ID: <200707261134.44922.gaspa@yattaweb.it> Hi, i'm glad to be the first post in this list.... :-D Thanks roland. I'll ask you a bunch of questions soon. Bye! -- -gaspa- ----------------------------------------------- --------- Powered by Debian GNU/Linux --------- ------ HomePage: iogaspa.altervista.org ------- -Il lunedi'dell'arrampicatore: www.lunedi.org - From renzo at cs.unibo.it Thu Jul 26 09:42:12 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Thu, 26 Jul 2007 11:42:12 +0200 Subject: the first question (about the ML) Message-ID: <20070726094212.GB9546@cs.unibo.it> Hi Roland, it is possible to put a tag on the ML messages? (I mean [utrace-devel] in the subject)? I receive a huge amount of mail... I can miss messages without the tag (I can miss them even with the tag, but at a lower probability ;-) tnx renzo From roland at redhat.com Thu Jul 26 09:48:27 2007 From: roland at redhat.com (Roland McGrath) Date: Thu, 26 Jul 2007 02:48:27 -0700 (PDT) Subject: the first question (about the ML) In-Reply-To: Renzo Davoli's message of Thursday, 26 July 2007 11:42:12 +0200 <20070726094212.GB9546@cs.unibo.it> Message-ID: <20070726094827.29E2E4D058D@magilla.localdomain> I personally despise list tags and they are not the practice on mosts lists I'm on. There are the various List- headers and X-loop to make filtering easy. Thanks, Roland From roland at redhat.com Thu Jul 26 21:08:40 2007 From: roland at redhat.com (Roland McGrath) Date: Thu, 26 Jul 2007 14:08:40 -0700 (PDT) Subject: welcome to utrace-devel Message-ID: <20070726210840.B2B5C4D058D@magilla.localdomain> Now that several people have joined the list, I'd like to kick things off with some pointers. Some past discussions about utrace details have taken place on the systemtap at sourceware.org and frysk at sourceware.org mailing lists. If some posts here read like someone started in the middle of an ongoing discussion, it's because they did. If you are missing some context, please look for it in those lists' archives at http://sourceware.org/ml/systemtap/ and http://sourceware.org/ml/frysk/. Next, I call your attention to http://sourceware.org/systemtap/wiki/utrace This is an empty placeholder page now, but can be a place to hang informal stuff about utrace. I hope that folks on this list will want either to edit that wiki or to house a utrace wiki somewhere else if that's more convenient. I'm lousy at writing up web and wiki pages about stuff, but I'm always droning on with details in email. It would be great if people cull info from threads here for wiki pages about the state of things. Here are a few subjects I imagine people might want to start talking about. We have a whole list here, so please start new, narrow threads with appropriate Subject lines, don't just discuss everything in followups to this message. * future work http://people.redhat.com/roland/utrace/TODO is a list in terms so vague they probably only make sense to me. That can be elaborated, wikified, etc. * known bugs I have a few, I will put together a list. * arch porting status I'm not sure anyone has read utrace-arch-porting-howto.txt lately. Perhaps I should be posting to linux-arch more trying to drum up interest in doing ports. * utrace back into -mm? (upstream integration status generally) I don't really know anything you don't know, but maybe we could know more. Thanks, Roland From gaspa at yattaweb.it Sat Jul 28 14:51:50 2007 From: gaspa at yattaweb.it (Andrea Gasparini) Date: Sat, 28 Jul 2007 16:51:50 +0200 Subject: my support in utrace. Message-ID: <200707281651.50750.gaspa@yattaweb.it> Hi roland, hi all. I'm finishing another work that has nothing to do with linux, and in a few days i'll be ready to give some contribution. So, - now, i'm searching all discussions about utrace, and i'll add to a wiki page. If someone could give link via mailing list, i'll put it on the wiki. (disclaimer: i'm quite "untidy", so perhaps someone would sort the information... :-P ) - after that: i would take under my hands the uml port of utrace, do you know if some work has already done? In order to take a look i just imported roland git repository. I found that there isn't any master branch (which in all tutorial/howto is the default when you clone a repository). It don't give to me problem, just curious. Bye! -- -gaspa- ----------------------------------------------- --------- Powered by Debian GNU/Linux --------- ------ HomePage: iogaspa.altervista.org ------- -Il lunedi'dell'arrampicatore: www.lunedi.org - From gaspa at yattaweb.it Sat Jul 28 17:23:27 2007 From: gaspa at yattaweb.it (Andrea Gasparini) Date: Sat, 28 Jul 2007 19:23:27 +0200 Subject: my support in utrace. In-Reply-To: <200707281651.50750.gaspa@yattaweb.it> References: <200707281651.50750.gaspa@yattaweb.it> Message-ID: <200707281923.27489.gaspa@yattaweb.it> > - after that: i would take under my hands the uml port of utrace, do you > know if some work has already done? I saw that Dike already done some patch in order to make uml work with utrace. ( from http://lkml.org/lkml/2007/3/16/407 and the file http://people.redhat.com/roland/utrace/2.6-current/utrace-tracehook-um.patch ) But from what I read, it seems that is doens't work (or compile). So, what's the state of art? Thanks, bye! -- -gaspa- ----------------------------------------------- --------- Powered by Debian GNU/Linux --------- ------ HomePage: iogaspa.altervista.org ------- -Il lunedi'dell'arrampicatore: www.lunedi.org - From renzo at cs.unibo.it Sat Jul 28 20:06:04 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Sat, 28 Jul 2007 22:06:04 +0200 Subject: my support in utrace. In-Reply-To: <200707281923.27489.gaspa@yattaweb.it> References: <200707281651.50750.gaspa@yattaweb.it> <200707281923.27489.gaspa@yattaweb.it> Message-ID: <20070728200604.GB24588@cs.unibo.it> > ( from http://lkml.org/lkml/2007/3/16/407 and the file > http://people.redhat.com/roland/utrace/2.6-current/utrace-tracehook-um.patch ) AFAIK Jeff's patch just eliminates ptrace code from um arch. I tried it some time ago, it worked but ptrace was missing. I don't know if something got broken from the time of the test. renzo From roland at redhat.com Mon Jul 30 03:22:00 2007 From: roland at redhat.com (Roland McGrath) Date: Sun, 29 Jul 2007 20:22:00 -0700 (PDT) Subject: my support in utrace. In-Reply-To: Andrea Gasparini's message of Saturday, 28 July 2007 16:51:50 +0200 <200707281651.50750.gaspa@yattaweb.it> Message-ID: <20070730032200.609934D058B@magilla.localdomain> > - after that: i would take under my hands the uml port of utrace, do you > know if some work has already done? Jeff Dike expressed interest in doing it, but I don't think he has done any of the regset work. What's there now (in utrace-tracehook-um.patch) is just the tracehook branch part of the work, which is enough simply to get UML to compile and work again with CONFIG_UTRACE=n (and hence, no ptrace). UML is different from a normal port, because it is not an architecture itself, but is a different variant port of each other architecture. The regset details are already chosen for each arch, and just need to be implemented compatibly in the UML context. > In order to take a look i just imported roland git repository. I found that > there isn't any master branch (which in all tutorial/howto is the default > when you clone a repository). > It don't give to me problem, just curious. I am no great expert on git, and even less when I started with it. The git.fedoraproject.org servers don't let me do much other than push heads, so I cannot easily clean up the repository there like I clean up my own local repositories. The right thing would be to have a refs/heads/master there is a symbolic ref for refs/heads/utrace-ptrace-compat. Thanks, Roland From jdike at linux.intel.com Mon Jul 30 13:48:26 2007 From: jdike at linux.intel.com (jdike) Date: Mon, 30 Jul 2007 09:48:26 -0400 Subject: my support in utrace. In-Reply-To: <20070730032200.609934D058B@magilla.localdomain> Message-ID: <000401c7d2b0$4e332440$be347f0a@amr.corp.intel.com> > Jeff Dike expressed interest in doing it, but I don't think he has done > any of the regset work. What's there now (in utrace-tracehook-um.patch) > is just the tracehook branch part of the work, which is enough simply to > get UML to compile and work again with CONFIG_UTRACE=n (and hence, no > ptrace). Correct. And I welcome anyone doing the UML utrace work :-) I'll be happy to provide whatever help is needed. > UML is different from a normal port, because it is not an architecture > itself, but is a different variant port of each other architecture. The > regset details are already chosen for each arch, and just need to be > implemented compatibly in the UML context. UML is its own architecture. It is definitely not a variant of the underlying arch. Probably 90% is the same on all arches, as they all have the same system call interface. The remaining part is the architecture stuff which shows through, like the register set (which ptrace/utrace has to deal with more than the other subsystems). Jeff From cmoller at redhat.com Wed Aug 1 04:15:13 2007 From: cmoller at redhat.com (Chris Moller) Date: Wed, 01 Aug 2007 00:15:13 -0400 Subject: A kernel-module interface to utrace. Message-ID: <46B008D1.60308@redhat.com> In case anyone's interested, I've written a fairly substantial beginning of a module that provides a user-space interface to utrace. The interface to the module consists of a collection of /proc/utrace entries that can be accessed via read(), write(), and ioctl(). The module supports, via a registration procedure, any number of concurrent clients such as frysk or any other debugger that needs the capabilities utrace offers, and for each registered client supports any number of attached processes. A present, the module provides a number of basic services, including: * Registering and unregistering clients. * Attaching and detaching processes for registered clients. * Running and quiescing attached processes. * Extracting the contents of the register sets supported by utrace under the existing architecture. * Reading process memory. * Extracting the environment strings in effect for attached tasks. * Extracting the process memory map (Some of these services are provided courtesy of utrace, some via other means available within the kernel.) For every client process, two /proc entries are created, /proc/utrace/$PID/cmd, and /proc/utrace/$PID/resp (where $PID is that of the client process). The first of these, via write() (or pwrite()) and ioctl() are used by clients to control the behavior of the module and to synchronously extract information. The "resp" entry is intended to be the file associated with a blocking read() and provides a means by which asynchronous utrace report_* callbacks can notify the client that various events (such as syscalls, execs, clones, exits, signals, etc.) have occurred. (Use of this capability requires two client threads, one for control operations, the other to wait for report_* events.) There are a lot of obvious capabilities that need implementing: writing regs, writing memory, possibly other capabilities usually provided by various /proc/$PID entries but more efficiently provided by the module, plus whatever else people want. A fairly brief look at uprobes suggests that those capabilities could be given a user-space interface through the module. Source code for the module is available via cvs at sourceware.org:/cvs/frysk, checkout frysk-utrace. The structure of the code is still in flux, but at the moment it's: /udb-*.[ch] source for a fake client to test the module /utracer/ source tree for the module and its userspace i/f /utracer/module source for the kernel module itself /utracer/utracer source for the userspace i/f to the module /utracer/include headers for use by by the client The whole lot, including the fake client, the module, and the module i/f, can be built by pounding in "./do_make"--I haven't gotten around to doing anything with automake and the like. You'll need to have the kernel headers installed. There are two ways to access the kernel: directly via read()s, write()s, and ioctl()s to the module (in the utracer/module directory), or through the interface library in utracer/utracer. The i/f lib source can be mined for examples on how to use the direct fileio i/f; the top-level udb-*.[ch] (mostly--there's still some direct fileio there, but I'm slowly migrating away from it) offers examples of the use of the lib i/f. Keep in mind that this is pre-alpha level code that, being kernel stuff, may slag your computer completely. (Though the worst it's ever done to my machines is lock the kernel up, and even that hasn't happened in a couple of weeks.) Another pending i/f I'm planning, and the major reason I implemented the userspace i/f hiding the fileio, is a Java wrapper around the userspace i/f--the fileio passes data in C structs which I'm fairly sure Java doesn't grok. (Caveat: I'm not a Java guy; see the .sig below.) At the moment, the fake client is i386-only, but all the arch-specific stuff is (or should be...) contained in the files udb-i386.[ch]. With a bit of porting and a brighter configure/build process, it shouldn't be hard to support other archs. So far as I know, the module and the interface lib are arch-independent (at least, any arch-dependent bits are taken care of by the kernel config of the kernel headers). Have fun; comments welcome, Chris Moller. -- Chris Moller Java: the blunt scissors of programming languages. -- Dave Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 252 bytes Desc: OpenPGP digital signature URL: From roland at redhat.com Thu Aug 2 05:55:39 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 1 Aug 2007 22:55:39 -0700 (PDT) Subject: utrace known bugs Message-ID: <20070802055540.037754D04B9@magilla.localdomain> Here are outstanding problems (aside from TODO items) I know of off hand. These range from bug reports I haven't looked into, to subtle old XXX comments for which I'll have to dredge up memories of the details. I hope someone likes to wikify or otherwise organize this list somewhere. * detach vs report race Alexey Dobriyan identified a race viz dead_engine_ops vs engine->flags Race explanation makes sense, but so far never been reproduced. Needs a little thought. * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534 Unknown wedge, maybe stale report * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532 "tkill(SIGCONT) is not reported by waitpid()" Unexamined * ia64 RBS scheme This is tortured IA64-specific issue that was known from the start but never properly handled. The original contributors of the IA64 port dropped the ball on this part of the implementation. It needs someone with coherent understanding of the ia64 RBS hardware and how the kernel uses it, to consult. * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002 crash, needs investigation * utrace_inject_signal ENOSYS for non-utrace_get_signal case Needs implementation with careful synchronization. Probably interface details change for "engine interaction" TODO item, making implementation story here different. * ptrace race conditions See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle for ptrace_state. Thanks, Roland From adobriyan at sw.ru Thu Aug 2 12:31:12 2007 From: adobriyan at sw.ru (Alexey Dobriyan) Date: Thu, 2 Aug 2007 16:31:12 +0400 Subject: utrace known bugs In-Reply-To: <20070802055540.037754D04B9@magilla.localdomain> References: <20070802055540.037754D04B9@magilla.localdomain> Message-ID: <20070802123112.GD6553@localhost.sw.ru> On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote: > Here are outstanding problems (aside from TODO items) I know of off hand. > These range from bug reports I haven't looked into, to subtle old XXX > comments for which I'll have to dredge up memories of the details. > I hope someone likes to wikify or otherwise organize this list somewhere. > > * detach vs report race > Alexey Dobriyan identified a race viz > dead_engine_ops vs engine->flags > Race explanation makes sense, but so far never been reproduced. > Needs a little thought. > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534 > Unknown wedge, maybe stale report > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532 > "tkill(SIGCONT) is not reported by waitpid()" > Unexamined > > * ia64 RBS scheme > This is tortured IA64-specific issue that was known from the start but > never properly handled. The original contributors of the IA64 port > dropped the ball on this part of the implementation. It needs someone > with coherent understanding of the ia64 RBS hardware and how the kernel > uses it, to consult. > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002 > crash, needs investigation > > * utrace_inject_signal ENOSYS for non-utrace_get_signal case > Needs implementation with careful synchronization. > Probably interface details change for "engine interaction" TODO item, > making implementation story here different. > > * ptrace race conditions > See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle > for ptrace_state. * unbounded utrace_engine_cache growth started from 31a9ef5cfcdbae804e3e180c158bf2352728765a, nobody knows why testcase: at the end of http://marc.info/?l=linux-kernel&m=117128445312243&w=2 * _pointer_ to struct utrace, which I personally count as design bug. Rationale to fold struct utrace into task_struct is that lifetime rules of task_struct are well established, well tested and so on. As was demonstrated it also removes much complexity from attaching logic. There is one more quick crash in rh bugzilla, but I'll post patch here very soon. From adobriyan at sw.ru Thu Aug 2 13:08:41 2007 From: adobriyan at sw.ru (Alexey Dobriyan) Date: Thu, 2 Aug 2007 17:08:41 +0400 Subject: [PATCH] late checking of permissions during PTRACE_ATTACH Message-ID: <20070802130841.GE6553@localhost.sw.ru> ptrace_attach() does permissions check _after_ actual attaching. Given that utrace_attach is quite non-trivial operation there is no way such ordering should be allowed -- the following program should crash the box in less than second. Something like: ./a.out $(pidof syslogd) #include #include int main(int argc, char *argv[]) { int pid = atoi(argv[1]); while (1) ptrace(PTRACE_ATTACH, pid, NULL, NULL); return 0; } Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<0000000000000000>] PGD 17f3ed067 PUD 17ec80067 PMD 0 Oops: 0010 [1] PREEMPT SMP CPU 1 Modules linked in: rtc Pid: 400, comm: udevd Not tainted 2.6.23-rc1-utrace #1 RIP: 0010:[<0000000000000000>] [<0000000000000000>] RSP: 0000:ffff81017ee4dcb0 EFLAGS: 00010202 RAX: ffffffff803cdbe0 RBX: ffff81017dfd6350 RCX: ffff81017f7a3080 RDX: 0000000000000000 RSI: ffff81017f7a3080 RDI: ffff81017dfd6350 RBP: 0000000000000021 R08: 0000000000000038 R09: ffffffff8025145e R10: 0000000000000007 R11: 0000000000000246 R12: 0000000000000020 R13: ffff81017f7a3080 R14: ffff81017ee4def8 R15: ffff81017ecca6e0 FS: 00002b98567836d0(0000) GS:ffff81017fc017c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000017f1e9000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process udevd (pid: 400, threadinfo ffff81017ee4c000, task ffff81017f7a3080) Stack: ffffffff8025058d ffff81017ecca6f0 ffff81017f7a3080 ffff81017ecca6e0 0000000000000007 ffff81017ee4dd58 ffff81017ee4def8 ffff81017ee4de78 ffffffff802506b5 ffff81017ecca700 ffff81017f7a3080 0000000000000007 Call Trace: [] report_quiescent+0x36/0x13c [] utrace_quiescent+0x22/0x215 [] utrace_get_signal+0x513/0x576 [] get_signal_to_deliver+0x10c/0x3d3 [] do_notify_resume+0x9c/0x728 [] _spin_unlock_irqrestore+0x3d/0x69 [] trace_hardirqs_on+0x116/0x13a [] _spin_unlock_irqrestore+0x49/0x69 [] trace_hardirqs_on_thunk+0x35/0x37 [] trace_hardirqs_on+0x116/0x13a [] retint_signal+0x46/0x90 Code: Bad RIP value. RIP [<0000000000000000>] RSP CR2: 0000000000000000 Kernel panic - not syncing: Fatal exception NOTE, NOTE, NOTE: this is exactly same backtrace as in "dead_engine_ops vs engine->flags" race, so it's reproducable :) If by some miracle RHEL5 will also be patched, closes https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=245735 Signed-off-by: Alexey Dobriyan --- kernel/ptrace.c | 42 +++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 21 deletions(-) --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -327,6 +327,8 @@ static int ptrace_attach(struct task_struct *task) goto bad; if (!task->mm) /* kernel threads */ goto bad; + if (!ptrace_may_attach(task)) + goto bad; pr_debug("%d ptrace_attach %d state %lu exit_code %x\n", current->pid, task->pid, task->state, task->exit_code); @@ -346,29 +348,27 @@ static int ptrace_attach(struct task_struct *task) current->pid, task->pid, task->state, task->exit_code); NO_LOCKS; - if (ptrace_may_attach(task)) { - state = ptrace_setup(task, engine, current, 0, - capable(CAP_SYS_PTRACE)); - if (IS_ERR(state)) - retval = PTR_ERR(state); - else { - retval = ptrace_setup_finish(task, state); + state = ptrace_setup(task, engine, current, 0, + capable(CAP_SYS_PTRACE)); + if (IS_ERR(state)) + retval = PTR_ERR(state); + else { + retval = ptrace_setup_finish(task, state); - pr_debug("%d ptrace_attach %d after ptrace_update (%d)" - " %lu exit_code %x\n", - current->pid, task->pid, retval, - task->state, task->exit_code); + pr_debug("%d ptrace_attach %d after ptrace_update (%d)" + " %lu exit_code %x\n", + current->pid, task->pid, retval, + task->state, task->exit_code); - if (retval) { - /* - * It died before we enabled any callbacks. - */ - if (retval == -EALREADY) - retval = -ESRCH; - BUG_ON(retval != -ESRCH); - ptrace_state_unlink(state); - ptrace_done(state); - } + if (retval) { + /* + * It died before we enabled any callbacks. + */ + if (retval == -EALREADY) + retval = -ESRCH; + BUG_ON(retval != -ESRCH); + ptrace_state_unlink(state); + ptrace_done(state); } } NO_LOCKS; From ananth at in.ibm.com Thu Aug 2 13:20:35 2007 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Thu, 2 Aug 2007 18:50:35 +0530 Subject: utrace known bugs In-Reply-To: <20070802123112.GD6553@localhost.sw.ru> References: <20070802055540.037754D04B9@magilla.localdomain> <20070802123112.GD6553@localhost.sw.ru> Message-ID: <20070802132035.GA13158@in.ibm.com> On Thu, Aug 02, 2007 at 04:31:12PM +0400, Alexey Dobriyan wrote: > On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote: > > Here are outstanding problems (aside from TODO items) I know of off hand. > > These range from bug reports I haven't looked into, to subtle old XXX > > comments for which I'll have to dredge up memories of the details. > > I hope someone likes to wikify or otherwise organize this list somewhere. > > > > * detach vs report race > > Alexey Dobriyan identified a race viz > > dead_engine_ops vs engine->flags > > Race explanation makes sense, but so far never been reproduced. > > Needs a little thought. > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534 > > Unknown wedge, maybe stale report > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532 > > "tkill(SIGCONT) is not reported by waitpid()" > > Unexamined > > > > * ia64 RBS scheme > > This is tortured IA64-specific issue that was known from the start but > > never properly handled. The original contributors of the IA64 port > > dropped the ball on this part of the implementation. It needs someone > > with coherent understanding of the ia64 RBS hardware and how the kernel > > uses it, to consult. > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002 > > crash, needs investigation > > > > * utrace_inject_signal ENOSYS for non-utrace_get_signal case > > Needs implementation with careful synchronization. > > Probably interface details change for "engine interaction" TODO item, > > making implementation story here different. > > > > * ptrace race conditions > > See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle > > for ptrace_state. > > * unbounded utrace_engine_cache growth > started from 31a9ef5cfcdbae804e3e180c158bf2352728765a, > nobody knows why > testcase: at the end of http://marc.info/?l=linux-kernel&m=117128445312243&w=2 > > * _pointer_ to struct utrace, which I personally count as design bug. > > Rationale to fold struct utrace into task_struct is that lifetime > rules of task_struct are well established, well tested and so on. As > was demonstrated it also removes much complexity from attaching logic. > > There is one more quick crash in rh bugzilla, but I'll post patch here very soon. Have you tested this with the latest utrace bits? AFAICS this specific issue was fixed in June, with this one-liner: --- a/include/linux/tracehook.h +++ b/include/linux/tracehook.h @@ -584,7 +584,7 @@ static inline void tracehook_report_deat { smp_mb(); if (tsk_utrace_flags(tsk) & (UTRACE_EVENT(DEATH) - | UTRACE_ACTION_QUIESCE)) + | UTRACE_EVENT(QUIESCE))) utrace_report_death(tsk, death_cookie); } Ananth From adobriyan at sw.ru Thu Aug 2 14:31:43 2007 From: adobriyan at sw.ru (Alexey Dobriyan) Date: Thu, 2 Aug 2007 18:31:43 +0400 Subject: utrace known bugs In-Reply-To: <20070802132035.GA13158@in.ibm.com> References: <20070802055540.037754D04B9@magilla.localdomain> <20070802123112.GD6553@localhost.sw.ru> <20070802132035.GA13158@in.ibm.com> Message-ID: <20070802143143.GF6553@localhost.sw.ru> On Thu, Aug 02, 2007 at 06:50:35PM +0530, Ananth N Mavinakayanahalli wrote: > On Thu, Aug 02, 2007 at 04:31:12PM +0400, Alexey Dobriyan wrote: > > On Wed, Aug 01, 2007 at 10:55:39PM -0700, Roland McGrath wrote: > > > Here are outstanding problems (aside from TODO items) I know of off hand. > > > These range from bug reports I haven't looked into, to subtle old XXX > > > comments for which I'll have to dredge up memories of the details. > > > I hope someone likes to wikify or otherwise organize this list somewhere. > > > > > > * detach vs report race > > > Alexey Dobriyan identified a race viz > > > dead_engine_ops vs engine->flags > > > Race explanation makes sense, but so far never been reproduced. > > > Needs a little thought. > > > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243534 > > > Unknown wedge, maybe stale report > > > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=248532 > > > "tkill(SIGCONT) is not reported by waitpid()" > > > Unexamined > > > > > > * ia64 RBS scheme > > > This is tortured IA64-specific issue that was known from the start but > > > never properly handled. The original contributors of the IA64 port > > > dropped the ball on this part of the implementation. It needs someone > > > with coherent understanding of the ia64 RBS hardware and how the kernel > > > uses it, to consult. > > > > > > * https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=207002 > > > crash, needs investigation > > > > > > * utrace_inject_signal ENOSYS for non-utrace_get_signal case > > > Needs implementation with careful synchronization. > > > Probably interface details change for "engine interaction" TODO item, > > > making implementation story here different. > > > > > > * ptrace race conditions > > > See XXX in kernel/ptrace.c; need to rethink synchronization/life-cycle > > > for ptrace_state. > > > > * unbounded utrace_engine_cache growth > > started from 31a9ef5cfcdbae804e3e180c158bf2352728765a, > > nobody knows why > > testcase: at the end of http://marc.info/?l=linux-kernel&m=117128445312243&w=2 > > > > * _pointer_ to struct utrace, which I personally count as design bug. > > > > Rationale to fold struct utrace into task_struct is that lifetime > > rules of task_struct are well established, well tested and so on. As > > was demonstrated it also removes much complexity from attaching logic. > > > > There is one more quick crash in rh bugzilla, but I'll post patch here very soon. > > Have you tested this with the latest utrace bits? struct utrace double free? I can't because of "unbounded utrace_engine_cache growth". "while (1) ptrace(PTRACE_ATTACH," crash -- yes it was tested with latest utrace. > AFAICS this specific > issue was fixed in June, with this one-liner: > > --- a/include/linux/tracehook.h > +++ b/include/linux/tracehook.h > @@ -584,7 +584,7 @@ static inline void tracehook_report_deat > { > smp_mb(); > if (tsk_utrace_flags(tsk) & (UTRACE_EVENT(DEATH) > - | UTRACE_ACTION_QUIESCE)) > + | UTRACE_EVENT(QUIESCE))) > utrace_report_death(tsk, death_cookie); > } From roland at redhat.com Fri Aug 3 04:16:52 2007 From: roland at redhat.com (Roland McGrath) Date: Thu, 2 Aug 2007 21:16:52 -0700 (PDT) Subject: uprobes: keep thread blocked after the handler returns In-Reply-To: Jim Keniston's message of Thursday, 2 August 2007 13:57:05 -0700 <1186088225.4521.52.camel@dyn9047018076.beaverton.ibm.com> Message-ID: <20070803041652.ECF254D04B5@magilla.localdomain> [This discussion started on the systemtap at sourceware.org mailing list.] > On Thu, 2007-08-02 at 14:19 -0700, Roland McGrath wrote: > > > Why not just prevent the handler from returning until you want the > > > thread to unblock? The handler could sleep on something trigger from > > > user space. > > > > That is a no-no for a utrace callback. It prevents other tracing > > facilities from doing anything useful. > > Yeah, you're pretty adamant about that in Documentation/utrace.txt, but > I always wondered why. Using gdb, I can stop a traced app and keep it > stopped all weekend. If nobody else is using the app, why should > anybody care? You can certainly keep the application stopped all year. You just have to do it the right way, by setting QUIESCE. What you can't do is hijack a user thread's scheduling context for your own purposes. That is not friendly. It prevents any other tracing engine (that is following the rules) from examining the thread, because it is not properly quiescent, just blocked somewhere in the kernel. If other tracing engines detach while you block, it prevents the data structures being cleaned up, for the same reason. It prevents anyone poking around the system with ps or whatnot from seeing a proper "T (tracing stop)" status to indicate why that process has not progressed all weekend. It breaks SIGKILL. If you do much work, other than block, it charges all those resources to the user thread instead of explicitly to instrumentation code, which may or may not matter, but is always relevant to the general issue. One of the key purposes of utrace is to make it easier not to screw up in all these ways. Look, bottom line, you are writing kernel module code, in an enforcement sense you already have carte blanche. But if you want me to change my definition of "well-behaved", it ain't gonna happen. > Sure, utrace allows multiple facilities to trace the same app, and we > may see circumstances where multiple users skew each others' results by > tracing the same app with blocking handlers. That doesn't seem like a > very good reason for preventing utrace handlers from doing useful stuff > that can't be accomplished without blocking. This misses most of the reasons well-behaved callbacks are important. (For the executive summary, "breaks SIGKILL" is all anyone really needs to know. Any debugging facility on Linux is broken by definition if it can ever prevent the timely completion of death by SIGKILL.) It also gives the impression that you think this API discipline for kernel modules in some way constrains what control you can exercise over user threads. That is not so (except you can't break SIGKILL). The utrace callbacks are the moral equivalent of interrupt handlers, ones that run with all other interrupts disabled. The handler utterly monopolizes the resource in question until it returns, in this case the thread rather than the CPU. They are not literally interrupt handlers, and by design run in a context that is as "safe" as it gets in kernel mode. But by analogy, considering the potential to break the system's behavior overall, they are in that vein. Think of writing a tracing engine as like writing a device driver. The devices you manipulate are user threads. You can put the device into a "frozen" state so it doesn't run off changing its state, but you don't do it by blocking in the interrupt handler. If instead of utrace callbacks, I had given you only a queued-event interface via something like an in-kernel socket, with "keep the thread quiescent until I dequeue this event and reply" being the only "immediate action" option, I suspect you would not be complaining. In either case, you have to structure your code in the same way. The main control flow of *your* logic takes place in a context *you* provide, either the calling thread from the userland debugger doing some control operation, or a kernel service thread you create. In this case, you also have to write the utrace callbacks that queue events or post wakeups for your main control code, as in a device interrupt handler. utrace is the *lowest layer*, it's not *the* interface. Perhaps most uses want something somewhere on the spectrum closer to the queued-event model, i.e. an interface giving a small set of things to do immediately on an event (that can be implemented in well-behaved ways) and making higher-level synchronization easy with your main control interface and with simple things like quiescing multiple threads. (I intend to produce a higher layer doing those things, but that is not the utrace layer.) My analogy to interrupt handlers is slightly dramatic. In fact, you can freely do all sorts of things in utrace callbacks. It's fine to do small blocks for memory allocation, even paging to access user memory, maybe even some i/o if you're sure it's quick. Anything that the thread could do itself from user mode, or in kernel code that is "instantaneous" (or at least uninterruptible) from the user-mode perspective. You can wait for a mutex/sem for some subsystem data or some lock of your own. If you have a locking bug and block forever there, it's not the end of the world. (It will break SIGKILL, and make the system take forever trying to do a graceful shutdown when you try to reboot.) Just no blocking until Tuesday, nor anything that is not close-by in your direct control with straightforward reasons to believe it won't ever block arbitrarily. Any kind of blocking as an explicit synchronization mechanism for the thread itself does not qualify. These are called rules for well-behaved tracing engines. They are there to help you mitigate the impact of bugs you can write, be kind to the environment, play well with others, etc. If you break the rules in a prototype, noone will send you to the stockade. Please help me with suggestions on how Documentation/utrace.txt can present this issue better. It seems it's managing to communicate "Roland doesn't want you to do it", but not why it's good for you, and for motherhood and apple pie, national security, and global child welfare, to refrain. Thanks, Roland From shaohua.li at intel.com Tue Aug 7 05:23:49 2007 From: shaohua.li at intel.com (Shaohua Li) Date: Tue, 07 Aug 2007 13:23:49 +0800 Subject: [PATCH]utrace: IA64 RSE bug Message-ID: <1186464229.784.9.camel@sli10-conroe.sh.intel.com> In ptrace case, user space RSE might be newer than kernel RSE. To avoid stale RSE is used when return to userspace, this patch synchronize user space RSE to kernel RSE. Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is overrided. patch is against latest utrace source (2.6.22 base + utrace patches) Signed-off-by: Bibo Mao Signed-off-by: Shaohua Li ============================================================== arch/ia64/kernel/perfmon.c | 21 ++-------------- arch/ia64/kernel/process.c | 14 +++++++++++ arch/ia64/kernel/ptrace.c | 52 +++++++++++++++++++++++++++++++++++++++++ include/asm-ia64/ptrace.h | 1 include/asm-ia64/thread_info.h | 4 +++ 5 files changed, 74 insertions(+), 18 deletions(-) Index: 2.6.22/arch/ia64/kernel/process.c =================================================================== --- 2.6.22.orig/arch/ia64/kernel/process.c 2007-08-07 10:48:04.000000000 +0800 +++ 2.6.22/arch/ia64/kernel/process.c 2007-08-07 11:40:42.000000000 +0800 @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs) show_stack(NULL, NULL); } +void tsk_clear_notify_resume(struct task_struct *tsk) +{ +#ifdef CONFIG_PERFMON + if (tsk->thread.pfm_needs_checking) + return; +#endif + if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE)) + return; + clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME); +} + void do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall) { @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused, /* deal with pending signal delivery */ if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK)) ia64_do_signal(scr, in_syscall); + /* copy user rbs to kernel rbs */ + if (unlikely(test_thread_flag(TIF_RESTORE_RSE))) + ia64_sync_krbs(current); } static int pal_halt = 1; Index: 2.6.22/arch/ia64/kernel/ptrace.c =================================================================== --- 2.6.22.orig/arch/ia64/kernel/ptrace.c 2007-08-07 10:48:04.000000000 +0800 +++ 2.6.22/arch/ia64/kernel/ptrace.c 2007-08-07 12:54:26.000000000 +0800 @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct * return 0; } +long +ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw, + unsigned long user_rbs_start, unsigned long user_rbs_end) +{ + unsigned long addr, val; + long ret; + + /* now copy word for word from user rbs to kernel rbs: */ + for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) { + if (access_process_vm(child, addr, &val, sizeof(val), 0) + != sizeof(val)) + return -EIO; + ret = ia64_poke(child, sw, user_rbs_end, addr, val); + if (ret < 0) + return ret; + } + return 0; +} + /* * Write f32-f127 back to task->thread.fph if it has been modified. */ @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(®s, 0); + /* copy user rbs to kernel rbs */ + if (test_thread_flag(TIF_RESTORE_RSE)) + ia64_sync_krbs(current); + if (unlikely(current->audit_context)) { long syscall; int arch; @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(®s, 1); + /* copy user rbs to kernel rbs */ + if (test_thread_flag(TIF_RESTORE_RSE)) + ia64_sync_krbs(current); + if (test_thread_flag(TIF_SINGLESTEP)) { force_sig(SIGTRAP, current); /* XXX */ tracehook_report_syscall_step(®s); @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar const struct utrace_regset *regset, int now) { + if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE)) + return 0; + tsk_set_notify_resume(target); return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL); } +static void do_gpregs_readback(struct unw_frame_info *info, void *arg) +{ + struct pt_regs *pt; + utrace_getset_t *dst = arg; + unsigned long urbs_end; + + if (unw_unwind_to_user(info) < 0) + return; + pt = task_pt_regs(dst->target); + urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL); + dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end); +} +/* + * This is called to read back the register backing store. + */ +long ia64_sync_krbs(struct task_struct *target) +{ + clear_tsk_thread_flag(target, TIF_RESTORE_RSE); + tsk_clear_notify_resume(target); + return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL); +} + static int fpregs_active(struct task_struct *target, const struct utrace_regset *regset) { Index: 2.6.22/include/asm-ia64/ptrace.h =================================================================== --- 2.6.22.orig/include/asm-ia64/ptrace.h 2007-08-07 10:48:04.000000000 +0800 +++ 2.6.22/include/asm-ia64/ptrace.h 2007-08-07 10:48:21.000000000 +0800 @@ -292,6 +292,7 @@ struct switch_stack { unsigned long, long); extern void ia64_flush_fph (struct task_struct *); extern void ia64_sync_fph (struct task_struct *); + extern long ia64_sync_krbs(struct task_struct *); extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *, unsigned long, unsigned long); Index: 2.6.22/include/asm-ia64/thread_info.h =================================================================== --- 2.6.22.orig/include/asm-ia64/thread_info.h 2007-08-07 10:48:04.000000000 +0800 +++ 2.6.22/include/asm-ia64/thread_info.h 2007-08-07 11:08:45.000000000 +0800 @@ -71,6 +71,9 @@ struct thread_info { #define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER)) #define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER) +#define tsk_set_notify_resume(tsk) \ + set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME) +extern void tsk_clear_notify_resume(struct task_struct *tsk); #endif /* !__ASSEMBLY */ /* @@ -86,6 +89,7 @@ struct thread_info { #define TIF_SYSCALL_AUDIT 4 /* syscall auditing active */ #define TIF_SINGLESTEP 5 /* restore singlestep on return to user mode */ #define TIF_RESTORE_SIGMASK 6 /* restore signal mask in do_signal() */ +#define TIF_RESTORE_RSE 9 /* task need RSE synchronization */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_MEMDIE 17 #define TIF_MCA_INIT 18 /* this task is processing MCA or INIT */ Index: 2.6.22/arch/ia64/kernel/perfmon.c =================================================================== --- 2.6.22.orig/arch/ia64/kernel/perfmon.c 2007-08-07 10:48:04.000000000 +0800 +++ 2.6.22/arch/ia64/kernel/perfmon.c 2007-08-07 11:01:07.000000000 +0800 @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task) } static inline void -pfm_set_task_notify(struct task_struct *task) -{ - struct thread_info *info; - - info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE); - set_bit(TIF_NOTIFY_RESUME, &info->flags); -} - -static inline void -pfm_clear_task_notify(void) -{ - clear_thread_flag(TIF_NOTIFY_RESUME); -} - -static inline void pfm_reserve_page(unsigned long a) { SetPageReserved(vmalloc_to_page((void *)a)); @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar PFM_SET_WORK_PENDING(task, 1); - pfm_set_task_notify(task); + tsk_set_notify_resume(task); /* * XXX: send reschedule if task runs on another CPU @@ -5087,7 +5072,7 @@ pfm_handle_work(void) PFM_SET_WORK_PENDING(current, 0); - pfm_clear_task_notify(); + tsk_clear_notify_resume(current); regs = task_pt_regs(current); @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct * when coming from ctxsw, current still points to the * previous task, therefore we must work with task and not current. */ - pfm_set_task_notify(task); + tsk_set_notify_resume(task); } /* * defer until state is changed (shorten spin window). the context is locked From roland at redhat.com Tue Aug 7 07:13:16 2007 From: roland at redhat.com (Roland McGrath) Date: Tue, 7 Aug 2007 00:13:16 -0700 (PDT) Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: Shaohua Li's message of Tuesday, 7 August 2007 13:23:49 +0800 <1186464229.784.9.camel@sli10-conroe.sh.intel.com> Message-ID: <20070807071316.801494D04C4@magilla.localdomain> That looks reasonable to me, not being expert on the ia64 details. I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport version. > patch is against latest utrace source (2.6.22 base + utrace patches) The latest utrace source is based on 2.6.23-rc2, and your patch does not apply there. TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK. So you need to rename it back (or to something else) to overload it. If the gist of your patch here seems right to ia64 folks, then I'd suggest you do a patch adding ia64_sync_krbs to the upstream kernel. That would be the necessary prelude to doing the arch_ptrace_stop plan, which I can help ia64 folks with later. Thanks, Roland From eranian at hpl.hp.com Tue Aug 7 21:55:31 2007 From: eranian at hpl.hp.com (Stephane Eranian) Date: Tue, 7 Aug 2007 14:55:31 -0700 Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: <20070807071316.801494D04C4@magilla.localdomain> References: <1186464229.784.9.camel@sli10-conroe.sh.intel.com> <20070807071316.801494D04C4@magilla.localdomain> Message-ID: <20070807215531.GA23261@frankl.hpl.hp.com> Hello, On Tue, Aug 07, 2007 at 12:13:16AM -0700, Roland McGrath wrote: > That looks reasonable to me, not being expert on the ia64 details. > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport version. > > > patch is against latest utrace source (2.6.22 base + utrace patches) > > The latest utrace source is based on 2.6.23-rc2, and your patch does not > apply there. TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK. > So you need to rename it back (or to something else) to overload it. > Yes, in 2.6.23, TIF_NOTIFY_RESUME has been removed because it was not used except for perfmon, thus I renamed it. The problem is that you cannot just override it because you would need a way to further demultiplex the reason for call do_notify_resume(). Perfmon not simply relies on the TIF_PERFMON_WORK bit being set. As you realize now this flag would mean you have some RSE work or some perfmon work. -- -Stephane From shaohua.li at intel.com Wed Aug 8 02:30:37 2007 From: shaohua.li at intel.com (Shaohua Li) Date: Wed, 08 Aug 2007 10:30:37 +0800 Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: <20070807071316.801494D04C4@magilla.localdomain> References: <20070807071316.801494D04C4@magilla.localdomain> Message-ID: <1186540237.16482.2.camel@sli10-conroe.sh.intel.com> On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote: > That looks reasonable to me, not being expert on the ia64 details. > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport > version. > > > patch is against latest utrace source (2.6.22 base + utrace patches) > > The latest utrace source is based on 2.6.23-rc2, and your patch does > not > apply there. TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK. > So you need to rename it back (or to something else) to overload it. > > If the gist of your patch here seems right to ia64 folks, then I'd > suggest > you do a patch adding ia64_sync_krbs to the upstream kernel. That > would be > the necessary prelude to doing the arch_ptrace_stop plan, which I can > help > ia64 folks with later. Ok, this is the patch against 2.6.23-rc2 In ptrace case, user space RSE might be newer than kernel RSE. To avoid stale RSE is used when return to userspace, this patch synchronize user space RSE to kernel RSE. Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is overrided. Signed-off-by: Bibo Mao Signed-off-by: Shaohua Li ============================================================== arch/ia64/kernel/perfmon.c | 21 ++-------------- arch/ia64/kernel/process.c | 14 +++++++++++ arch/ia64/kernel/ptrace.c | 52 +++++++++++++++++++++++++++++++++++++++++ include/asm-ia64/ptrace.h | 1 include/asm-ia64/thread_info.h | 11 ++++++-- 5 files changed, 78 insertions(+), 21 deletions(-) Index: 2.6.23-rc2/arch/ia64/kernel/process.c =================================================================== --- 2.6.23-rc2.orig/arch/ia64/kernel/process.c 2007-08-04 10:49:55.000000000 +0800 +++ 2.6.23-rc2/arch/ia64/kernel/process.c 2007-08-08 09:27:18.000000000 +0800 @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs) show_stack(NULL, NULL); } +void tsk_clear_notify_resume(struct task_struct *tsk) +{ +#ifdef CONFIG_PERFMON + if (tsk->thread.pfm_needs_checking) + return; +#endif + if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE)) + return; + clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME); +} + void do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall) { @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused, /* deal with pending signal delivery */ if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK)) ia64_do_signal(scr, in_syscall); + /* copy user rbs to kernel rbs */ + if (unlikely(test_thread_flag(TIF_RESTORE_RSE))) + ia64_sync_krbs(current); } static int pal_halt = 1; Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c =================================================================== --- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c 2007-08-08 09:00:28.000000000 +0800 +++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c 2007-08-08 09:27:18.000000000 +0800 @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct * return 0; } +long +ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw, + unsigned long user_rbs_start, unsigned long user_rbs_end) +{ + unsigned long addr, val; + long ret; + + /* now copy word for word from user rbs to kernel rbs: */ + for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) { + if (access_process_vm(child, addr, &val, sizeof(val), 0) + != sizeof(val)) + return -EIO; + ret = ia64_poke(child, sw, user_rbs_end, addr, val); + if (ret < 0) + return ret; + } + return 0; +} + /* * Write f32-f127 back to task->thread.fph if it has been modified. */ @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(®s, 0); + /* copy user rbs to kernel rbs */ + if (test_thread_flag(TIF_RESTORE_RSE)) + ia64_sync_krbs(current); + if (unlikely(current->audit_context)) { long syscall; int arch; @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(®s, 1); + /* copy user rbs to kernel rbs */ + if (test_thread_flag(TIF_RESTORE_RSE)) + ia64_sync_krbs(current); + if (test_thread_flag(TIF_SINGLESTEP)) { force_sig(SIGTRAP, current); /* XXX */ tracehook_report_syscall_step(®s); @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar const struct utrace_regset *regset, int now) { + if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE)) + return 0; + tsk_set_notify_resume(target); return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL); } +static void do_gpregs_readback(struct unw_frame_info *info, void *arg) +{ + struct pt_regs *pt; + utrace_getset_t *dst = arg; + unsigned long urbs_end; + + if (unw_unwind_to_user(info) < 0) + return; + pt = task_pt_regs(dst->target); + urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL); + dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end); +} +/* + * This is called to read back the register backing store. + */ +long ia64_sync_krbs(struct task_struct *target) +{ + clear_tsk_thread_flag(target, TIF_RESTORE_RSE); + tsk_clear_notify_resume(target); + return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL); +} + static int fpregs_active(struct task_struct *target, const struct utrace_regset *regset) { Index: 2.6.23-rc2/include/asm-ia64/ptrace.h =================================================================== --- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h 2007-08-04 10:49:55.000000000 +0800 +++ 2.6.23-rc2/include/asm-ia64/ptrace.h 2007-08-08 09:27:18.000000000 +0800 @@ -292,6 +292,7 @@ struct switch_stack { unsigned long, long); extern void ia64_flush_fph (struct task_struct *); extern void ia64_sync_fph (struct task_struct *); + extern long ia64_sync_krbs(struct task_struct *); extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *, unsigned long, unsigned long); Index: 2.6.23-rc2/include/asm-ia64/thread_info.h =================================================================== --- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h 2007-08-04 10:49:55.000000000 +0800 +++ 2.6.23-rc2/include/asm-ia64/thread_info.h 2007-08-08 09:35:08.000000000 +0800 @@ -71,6 +71,9 @@ struct thread_info { #define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER)) #define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER) +#define tsk_set_notify_resume(tsk) \ + set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME) +extern void tsk_clear_notify_resume(struct task_struct *tsk); #endif /* !__ASSEMBLY */ /* @@ -85,28 +88,30 @@ struct thread_info { #define TIF_SYSCALL_AUDIT 3 /* syscall auditing active */ #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ #define TIF_RESTORE_SIGMASK 5 /* restore signal mask in do_signal() */ -#define TIF_PERFMON_WORK 6 /* work for pfm_handle_work() */ +#define TIF_NOTIFY_RESUME 6 /* resumption notification requested */ #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_MEMDIE 17 #define TIF_MCA_INIT 18 /* this task is processing MCA or INIT */ #define TIF_DB_DISABLED 19 /* debug trap disabled for fsyscall */ #define TIF_FREEZE 20 /* is freezing for suspend */ +#define TIF_RESTORE_RSE 21 /* task need RSE synchronization */ #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SYSCALL_TRACEAUDIT (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP) #define _TIF_RESTORE_SIGMASK (1 << TIF_RESTORE_SIGMASK) -#define _TIF_PERFMON_WORK (1 << TIF_PERFMON_WORK) +#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_MCA_INIT (1 << TIF_MCA_INIT) #define _TIF_DB_DISABLED (1 << TIF_DB_DISABLED) #define _TIF_FREEZE (1 << TIF_FREEZE) +#define _TIF_RESTORE_RSE (1 << TIF_RESTORE_RSE) /* "work to do on user-return" bits */ -#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_PERFMON_WORK|_TIF_SYSCALL_AUDIT|\ +#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SYSCALL_AUDIT|\ _TIF_NEED_RESCHED| _TIF_SYSCALL_TRACE|\ _TIF_RESTORE_SIGMASK) /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */ Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c =================================================================== --- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c 2007-08-04 10:49:55.000000000 +0800 +++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c 2007-08-08 09:32:41.000000000 +0800 @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task) } static inline void -pfm_set_task_notify(struct task_struct *task) -{ - struct thread_info *info; - - info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE); - set_bit(TIF_PERFMON_WORK, &info->flags); -} - -static inline void -pfm_clear_task_notify(void) -{ - clear_thread_flag(TIF_PERFMON_WORK); -} - -static inline void pfm_reserve_page(unsigned long a) { SetPageReserved(vmalloc_to_page((void *)a)); @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar PFM_SET_WORK_PENDING(task, 1); - pfm_set_task_notify(task); + tsk_set_notify_resume(task); /* * XXX: send reschedule if task runs on another CPU @@ -5087,7 +5072,7 @@ pfm_handle_work(void) PFM_SET_WORK_PENDING(current, 0); - pfm_clear_task_notify(); + tsk_clear_notify_resume(current); regs = task_pt_regs(current); @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct * when coming from ctxsw, current still points to the * previous task, therefore we must work with task and not current. */ - pfm_set_task_notify(task); + tsk_set_notify_resume(task); } /* * defer until state is changed (shorten spin window). the context is locked From eranian at hpl.hp.com Wed Aug 8 23:10:59 2007 From: eranian at hpl.hp.com (Stephane Eranian) Date: Wed, 8 Aug 2007 16:10:59 -0700 Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: <1186540237.16482.2.camel@sli10-conroe.sh.intel.com> References: <20070807071316.801494D04C4@magilla.localdomain> <1186540237.16482.2.camel@sli10-conroe.sh.intel.com> Message-ID: <20070808231059.GC25934@frankl.hpl.hp.com> Hi, Well, looks like we need to resurect TIF_NOTIFY_RESUME now that we do have some actual need for it. This implies that I need to rework my perfmon new code base to take this into account. I don't want to use pfm_needs_checking. So maybe it is okay to use a higher order TIF bit for perfmon on IA-64 combined with TIF_NOTIFY_RESUME. For the RSE part, I am not sure I understand the logic behind this change. I think more explanantions maybe needed to calrify this a bit more. On Wed, Aug 08, 2007 at 10:30:37AM +0800, Shaohua Li wrote: > On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote: > > That looks reasonable to me, not being expert on the ia64 details. > > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport > > version. > > > > > patch is against latest utrace source (2.6.22 base + utrace patches) > > > > The latest utrace source is based on 2.6.23-rc2, and your patch does > > not > > apply there. TIF_NOTIFY_RESUME has been renamed to TIF_PERFMON_WORK. > > So you need to rename it back (or to something else) to overload it. > > > > If the gist of your patch here seems right to ia64 folks, then I'd > > suggest > > you do a patch adding ia64_sync_krbs to the upstream kernel. That > > would be > > the necessary prelude to doing the arch_ptrace_stop plan, which I can > > help > > ia64 folks with later. > Ok, this is the patch against 2.6.23-rc2 > > > In ptrace case, user space RSE might be newer than kernel RSE. To avoid > stale RSE is used when return to userspace, this patch synchronize user > space RSE to kernel RSE. > Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is > overrided. > > Signed-off-by: Bibo Mao > Signed-off-by: Shaohua Li > > ============================================================== > arch/ia64/kernel/perfmon.c | 21 ++-------------- > arch/ia64/kernel/process.c | 14 +++++++++++ > arch/ia64/kernel/ptrace.c | 52 +++++++++++++++++++++++++++++++++++++++++ > include/asm-ia64/ptrace.h | 1 > include/asm-ia64/thread_info.h | 11 ++++++-- > 5 files changed, 78 insertions(+), 21 deletions(-) > > Index: 2.6.23-rc2/arch/ia64/kernel/process.c > =================================================================== > --- 2.6.23-rc2.orig/arch/ia64/kernel/process.c 2007-08-04 10:49:55.000000000 +0800 > +++ 2.6.23-rc2/arch/ia64/kernel/process.c 2007-08-08 09:27:18.000000000 +0800 > @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs) > show_stack(NULL, NULL); > } > > +void tsk_clear_notify_resume(struct task_struct *tsk) > +{ > +#ifdef CONFIG_PERFMON > + if (tsk->thread.pfm_needs_checking) > + return; > +#endif > + if (test_ti_thread_flag(task_thread_info(tsk), TIF_RESTORE_RSE)) > + return; > + clear_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME); > +} > + > void > do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, long in_syscall) > { > @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused, > /* deal with pending signal delivery */ > if (test_thread_flag(TIF_SIGPENDING)||test_thread_flag(TIF_RESTORE_SIGMASK)) > ia64_do_signal(scr, in_syscall); > + /* copy user rbs to kernel rbs */ > + if (unlikely(test_thread_flag(TIF_RESTORE_RSE))) > + ia64_sync_krbs(current); > } > > static int pal_halt = 1; > Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c > =================================================================== > --- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c 2007-08-08 09:00:28.000000000 +0800 > +++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c 2007-08-08 09:27:18.000000000 +0800 > @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct * > return 0; > } > > +long > +ia64_sync_kernel_rbs (struct task_struct *child, struct switch_stack *sw, > + unsigned long user_rbs_start, unsigned long user_rbs_end) > +{ > + unsigned long addr, val; > + long ret; > + > + /* now copy word for word from user rbs to kernel rbs: */ > + for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) { > + if (access_process_vm(child, addr, &val, sizeof(val), 0) > + != sizeof(val)) > + return -EIO; > + ret = ia64_poke(child, sw, user_rbs_end, addr, val); > + if (ret < 0) > + return ret; > + } > + return 0; > +} > + > /* > * Write f32-f127 back to task->thread.fph if it has been modified. > */ > @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg > if (test_thread_flag(TIF_SYSCALL_TRACE)) > tracehook_report_syscall(®s, 0); > > + /* copy user rbs to kernel rbs */ > + if (test_thread_flag(TIF_RESTORE_RSE)) > + ia64_sync_krbs(current); > + > if (unlikely(current->audit_context)) { > long syscall; > int arch; > @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg > if (test_thread_flag(TIF_SYSCALL_TRACE)) > tracehook_report_syscall(®s, 1); > > + /* copy user rbs to kernel rbs */ > + if (test_thread_flag(TIF_RESTORE_RSE)) > + ia64_sync_krbs(current); > + > if (test_thread_flag(TIF_SINGLESTEP)) { > force_sig(SIGTRAP, current); /* XXX */ > tracehook_report_syscall_step(®s); > @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar > const struct utrace_regset *regset, > int now) > { > + if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE)) > + return 0; > + tsk_set_notify_resume(target); > return do_regset_call(do_gpregs_writeback, target, regset, 0, 0, NULL, NULL); > } > > +static void do_gpregs_readback(struct unw_frame_info *info, void *arg) > +{ > + struct pt_regs *pt; > + utrace_getset_t *dst = arg; > + unsigned long urbs_end; > + > + if (unw_unwind_to_user(info) < 0) > + return; > + pt = task_pt_regs(dst->target); > + urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL); > + dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, pt->ar_bspstore, urbs_end); > +} > +/* > + * This is called to read back the register backing store. > + */ > +long ia64_sync_krbs(struct task_struct *target) > +{ > + clear_tsk_thread_flag(target, TIF_RESTORE_RSE); > + tsk_clear_notify_resume(target); > + return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, NULL, NULL); > +} > + > static int > fpregs_active(struct task_struct *target, const struct utrace_regset *regset) > { > Index: 2.6.23-rc2/include/asm-ia64/ptrace.h > =================================================================== > --- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h 2007-08-04 10:49:55.000000000 +0800 > +++ 2.6.23-rc2/include/asm-ia64/ptrace.h 2007-08-08 09:27:18.000000000 +0800 > @@ -292,6 +292,7 @@ struct switch_stack { > unsigned long, long); > extern void ia64_flush_fph (struct task_struct *); > extern void ia64_sync_fph (struct task_struct *); > + extern long ia64_sync_krbs(struct task_struct *); > extern long ia64_sync_user_rbs (struct task_struct *, struct switch_stack *, > unsigned long, unsigned long); > > Index: 2.6.23-rc2/include/asm-ia64/thread_info.h > =================================================================== > --- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h 2007-08-04 10:49:55.000000000 +0800 > +++ 2.6.23-rc2/include/asm-ia64/thread_info.h 2007-08-08 09:35:08.000000000 +0800 > @@ -71,6 +71,9 @@ struct thread_info { > #define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER)) > #define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER) > > +#define tsk_set_notify_resume(tsk) \ > + set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME) > +extern void tsk_clear_notify_resume(struct task_struct *tsk); > #endif /* !__ASSEMBLY */ > > /* > @@ -85,28 +88,30 @@ struct thread_info { > #define TIF_SYSCALL_AUDIT 3 /* syscall auditing active */ > #define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */ > #define TIF_RESTORE_SIGMASK 5 /* restore signal mask in do_signal() */ > -#define TIF_PERFMON_WORK 6 /* work for pfm_handle_work() */ > +#define TIF_NOTIFY_RESUME 6 /* resumption notification requested */ > #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling TIF_NEED_RESCHED */ > #define TIF_MEMDIE 17 > #define TIF_MCA_INIT 18 /* this task is processing MCA or INIT */ > #define TIF_DB_DISABLED 19 /* debug trap disabled for fsyscall */ > #define TIF_FREEZE 20 /* is freezing for suspend */ > +#define TIF_RESTORE_RSE 21 /* task need RSE synchronization */ > > #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) > #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) > #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) > #define _TIF_SYSCALL_TRACEAUDIT (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP) > #define _TIF_RESTORE_SIGMASK (1 << TIF_RESTORE_SIGMASK) > -#define _TIF_PERFMON_WORK (1 << TIF_PERFMON_WORK) > +#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) > #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) > #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) > #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) > #define _TIF_MCA_INIT (1 << TIF_MCA_INIT) > #define _TIF_DB_DISABLED (1 << TIF_DB_DISABLED) > #define _TIF_FREEZE (1 << TIF_FREEZE) > +#define _TIF_RESTORE_RSE (1 << TIF_RESTORE_RSE) > > /* "work to do on user-return" bits */ > -#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_PERFMON_WORK|_TIF_SYSCALL_AUDIT|\ > +#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_NOTIFY_RESUME|_TIF_SYSCALL_AUDIT|\ > _TIF_NEED_RESCHED| _TIF_SYSCALL_TRACE|\ > _TIF_RESTORE_SIGMASK) > /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT */ > Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c > =================================================================== > --- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c 2007-08-04 10:49:55.000000000 +0800 > +++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c 2007-08-08 09:32:41.000000000 +0800 > @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task) > } > > static inline void > -pfm_set_task_notify(struct task_struct *task) > -{ > - struct thread_info *info; > - > - info = (struct thread_info *) ((char *) task + IA64_TASK_SIZE); > - set_bit(TIF_PERFMON_WORK, &info->flags); > -} > - > -static inline void > -pfm_clear_task_notify(void) > -{ > - clear_thread_flag(TIF_PERFMON_WORK); > -} > - > -static inline void > pfm_reserve_page(unsigned long a) > { > SetPageReserved(vmalloc_to_page((void *)a)); > @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar > > PFM_SET_WORK_PENDING(task, 1); > > - pfm_set_task_notify(task); > + tsk_set_notify_resume(task); > > /* > * XXX: send reschedule if task runs on another CPU > @@ -5087,7 +5072,7 @@ pfm_handle_work(void) > > PFM_SET_WORK_PENDING(current, 0); > > - pfm_clear_task_notify(); > + tsk_clear_notify_resume(current); > > regs = task_pt_regs(current); > > @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct > * when coming from ctxsw, current still points to the > * previous task, therefore we must work with task and not current. > */ > - pfm_set_task_notify(task); > + tsk_set_notify_resume(task); > } > /* > * defer until state is changed (shorten spin window). the context is locked > - > To unsubscribe from this list: send the line "unsubscribe linux-ia64" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- -Stephane From shaohua.li at intel.com Thu Aug 9 02:35:44 2007 From: shaohua.li at intel.com (Shaohua Li) Date: Thu, 09 Aug 2007 10:35:44 +0800 Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: <20070808231059.GC25934@frankl.hpl.hp.com> References: <20070807071316.801494D04C4@magilla.localdomain> <1186540237.16482.2.camel@sli10-conroe.sh.intel.com> <20070808231059.GC25934@frankl.hpl.hp.com> Message-ID: <1186626944.29125.8.camel@sli10-conroe.sh.intel.com> On Thu, 2007-08-09 at 07:10 +0800, Stephane Eranian wrote: > Well, looks like we need to resurect TIF_NOTIFY_RESUME now > that we do have some actual need for it. This implies that > I need to rework my perfmon new code base to take this into > account. I don't want to use pfm_needs_checking. So maybe > it is okay to use a higher order TIF bit for perfmon on IA-64 > combined with TIF_NOTIFY_RESUME. Fine. > For the RSE part, I am not sure I understand the logic behind this > change. I think more explanantions maybe needed to calrify this > a bit more. Basically for a ptraced thread, the patch will copy user RSE to kernel RSE every time the task will be switched to user space. The reason is when the thread is stopped (ptraced), debugger might change thread's user stack, and we must avoid the RSE stored in kernel to override user stack (user space's RSE is newer in the case), so we copy user RSE to kernel first and kernel will user the newer RSE to return to user. While I said we do the synchronization when return to user space, there is an exception, which is syscall trace enter. debugger might change some syscall parameters in syscall trace enter, and kernel will use the parameters, so we do an extra synchronization for the case. Thanks, Shaohua > > On Wed, Aug 08, 2007 at 10:30:37AM +0800, Shaohua Li wrote: > > On Tue, 2007-08-07 at 15:13 +0800, Roland McGrath wrote: > > > That looks reasonable to me, not being expert on the ia64 details. > > > I rolled that into utrace-regset-ia64.patch in the 2.6.22 backport > > > version. > > > > > > > patch is against latest utrace source (2.6.22 base + utrace > patches) > > > > > > The latest utrace source is based on 2.6.23-rc2, and your patch > does > > > not > > > apply there. TIF_NOTIFY_RESUME has been renamed to > TIF_PERFMON_WORK. > > > So you need to rename it back (or to something else) to overload > it. > > > > > > If the gist of your patch here seems right to ia64 folks, then I'd > > > suggest > > > you do a patch adding ia64_sync_krbs to the upstream kernel. That > > > would be > > > the necessary prelude to doing the arch_ptrace_stop plan, which I > can > > > help > > > ia64 folks with later. > > Ok, this is the patch against 2.6.23-rc2 > > > > > > In ptrace case, user space RSE might be newer than kernel RSE. To > avoid > > stale RSE is used when return to userspace, this patch synchronize > user > > space RSE to kernel RSE. > > Also, as TIF_ALLWORK_MASK bits are limited, TIF_NOTIFY_RESUME is > > overrided. > > > > Signed-off-by: Bibo Mao > > Signed-off-by: Shaohua Li > > > > ============================================================== > > arch/ia64/kernel/perfmon.c | 21 ++-------------- > > arch/ia64/kernel/process.c | 14 +++++++++++ > > arch/ia64/kernel/ptrace.c | 52 > +++++++++++++++++++++++++++++++++++++++++ > > include/asm-ia64/ptrace.h | 1 > > include/asm-ia64/thread_info.h | 11 ++++++-- > > 5 files changed, 78 insertions(+), 21 deletions(-) > > > > Index: 2.6.23-rc2/arch/ia64/kernel/process.c > > =================================================================== > > --- 2.6.23-rc2.orig/arch/ia64/kernel/process.c 2007-08-04 > 10:49:55.000000000 +0800 > > +++ 2.6.23-rc2/arch/ia64/kernel/process.c 2007-08-08 > 09:27:18.000000000 +0800 > > @@ -154,6 +154,17 @@ show_regs (struct pt_regs *regs) > > show_stack(NULL, NULL); > > } > > > > +void tsk_clear_notify_resume(struct task_struct *tsk) > > +{ > > +#ifdef CONFIG_PERFMON > > + if (tsk->thread.pfm_needs_checking) > > + return; > > +#endif > > + if (test_ti_thread_flag(task_thread_info(tsk), > TIF_RESTORE_RSE)) > > + return; > > + clear_ti_thread_flag(task_thread_info(tsk), > TIF_NOTIFY_RESUME); > > +} > > + > > void > > do_notify_resume_user (sigset_t *unused, struct sigscratch *scr, > long in_syscall) > > { > > @@ -172,6 +183,9 @@ do_notify_resume_user (sigset_t *unused, > > /* deal with pending signal delivery */ > > if (test_thread_flag(TIF_SIGPENDING)|| > test_thread_flag(TIF_RESTORE_SIGMASK)) > > ia64_do_signal(scr, in_syscall); > > + /* copy user rbs to kernel rbs */ > > + if (unlikely(test_thread_flag(TIF_RESTORE_RSE))) > > + ia64_sync_krbs(current); > > } > > > > static int pal_halt = 1; > > Index: 2.6.23-rc2/arch/ia64/kernel/ptrace.c > > =================================================================== > > --- 2.6.23-rc2.orig/arch/ia64/kernel/ptrace.c 2007-08-08 > 09:00:28.000000000 +0800 > > +++ 2.6.23-rc2/arch/ia64/kernel/ptrace.c 2007-08-08 > 09:27:18.000000000 +0800 > > @@ -554,6 +554,25 @@ ia64_sync_user_rbs (struct task_struct * > > return 0; > > } > > > > +long > > +ia64_sync_kernel_rbs (struct task_struct *child, struct > switch_stack *sw, > > + unsigned long user_rbs_start, unsigned long > user_rbs_end) > > +{ > > + unsigned long addr, val; > > + long ret; > > + > > + /* now copy word for word from user rbs to kernel rbs: */ > > + for (addr = user_rbs_start; addr < user_rbs_end; addr += 8) > { > > + if (access_process_vm(child, addr, &val, > sizeof(val), 0) > > + != sizeof(val)) > > + return -EIO; > > + ret = ia64_poke(child, sw, user_rbs_end, addr, > val); > > + if (ret < 0) > > + return ret; > > + } > > + return 0; > > +} > > + > > /* > > * Write f32-f127 back to task->thread.fph if it has been modified. > > */ > > @@ -728,6 +747,10 @@ syscall_trace_enter (long arg0, long arg > > if (test_thread_flag(TIF_SYSCALL_TRACE)) > > tracehook_report_syscall(®s, 0); > > > > + /* copy user rbs to kernel rbs */ > > + if (test_thread_flag(TIF_RESTORE_RSE)) > > + ia64_sync_krbs(current); > > + > > if (unlikely(current->audit_context)) { > > long syscall; > > int arch; > > @@ -764,6 +787,10 @@ syscall_trace_leave (long arg0, long arg > > if (test_thread_flag(TIF_SYSCALL_TRACE)) > > tracehook_report_syscall(®s, 1); > > > > + /* copy user rbs to kernel rbs */ > > + if (test_thread_flag(TIF_RESTORE_RSE)) > > + ia64_sync_krbs(current); > > + > > if (test_thread_flag(TIF_SINGLESTEP)) { > > force_sig(SIGTRAP, current); /* XXX */ > > tracehook_report_syscall_step(®s); > > @@ -1416,9 +1443,34 @@ gpregs_writeback(struct task_struct *tar > > const struct utrace_regset *regset, > > int now) > > { > > + if (test_and_set_tsk_thread_flag(target, TIF_RESTORE_RSE)) > > + return 0; > > + tsk_set_notify_resume(target); > > return do_regset_call(do_gpregs_writeback, target, regset, 0, > 0, NULL, NULL); > > } > > > > +static void do_gpregs_readback(struct unw_frame_info *info, void > *arg) > > +{ > > + struct pt_regs *pt; > > + utrace_getset_t *dst = arg; > > + unsigned long urbs_end; > > + > > + if (unw_unwind_to_user(info) < 0) > > + return; > > + pt = task_pt_regs(dst->target); > > + urbs_end = ia64_get_user_rbs_end(dst->target, pt, NULL); > > + dst->ret = ia64_sync_kernel_rbs(dst->target, info->sw, > pt->ar_bspstore, urbs_end); > > +} > > +/* > > + * This is called to read back the register backing store. > > + */ > > +long ia64_sync_krbs(struct task_struct *target) > > +{ > > + clear_tsk_thread_flag(target, TIF_RESTORE_RSE); > > + tsk_clear_notify_resume(target); > > + return do_regset_call(do_gpregs_readback, target, NULL, 0, 0, > NULL, NULL); > > +} > > + > > static int > > fpregs_active(struct task_struct *target, const struct > utrace_regset *regset) > > { > > Index: 2.6.23-rc2/include/asm-ia64/ptrace.h > > =================================================================== > > --- 2.6.23-rc2.orig/include/asm-ia64/ptrace.h 2007-08-04 > 10:49:55.000000000 +0800 > > +++ 2.6.23-rc2/include/asm-ia64/ptrace.h 2007-08-08 > 09:27:18.000000000 +0800 > > @@ -292,6 +292,7 @@ struct switch_stack { > > unsigned long, long); > > extern void ia64_flush_fph (struct task_struct *); > > extern void ia64_sync_fph (struct task_struct *); > > + extern long ia64_sync_krbs(struct task_struct *); > > extern long ia64_sync_user_rbs (struct task_struct *, struct > switch_stack *, > > unsigned long, unsigned long); > > > > Index: 2.6.23-rc2/include/asm-ia64/thread_info.h > > =================================================================== > > --- 2.6.23-rc2.orig/include/asm-ia64/thread_info.h 2007-08-04 > 10:49:55.000000000 +0800 > > +++ 2.6.23-rc2/include/asm-ia64/thread_info.h 2007-08-08 > 09:35:08.000000000 +0800 > > @@ -71,6 +71,9 @@ struct thread_info { > > #define alloc_task_struct() ((struct task_struct > *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER)) > > #define free_task_struct(tsk) free_pages((unsigned long) > (tsk), KERNEL_STACK_SIZE_ORDER) > > > > +#define tsk_set_notify_resume(tsk) \ > > + set_ti_thread_flag(task_thread_info(tsk), TIF_NOTIFY_RESUME) > > +extern void tsk_clear_notify_resume(struct task_struct *tsk); > > #endif /* !__ASSEMBLY */ > > > > /* > > @@ -85,28 +88,30 @@ struct thread_info { > > #define TIF_SYSCALL_AUDIT 3 /* syscall auditing active */ > > #define TIF_SINGLESTEP 4 /* restore singlestep > on return to user mode */ > > #define TIF_RESTORE_SIGMASK 5 /* restore signal mask in > do_signal() */ > > -#define TIF_PERFMON_WORK 6 /* work for pfm_handle_work() > */ > > +#define TIF_NOTIFY_RESUME 6 /* resumption notification > requested */ > > #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is > polling TIF_NEED_RESCHED */ > > #define TIF_MEMDIE 17 > > #define TIF_MCA_INIT 18 /* this task is processing MCA > or INIT */ > > #define TIF_DB_DISABLED 19 /* debug trap disabled > for fsyscall */ > > #define TIF_FREEZE 20 /* is freezing for suspend */ > > +#define TIF_RESTORE_RSE 21 /* task need RSE > synchronization */ > > > > #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) > > #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) > > #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) > > #define _TIF_SYSCALL_TRACEAUDIT (_TIF_SYSCALL_TRACE| > _TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP) > > #define _TIF_RESTORE_SIGMASK (1 << TIF_RESTORE_SIGMASK) > > -#define _TIF_PERFMON_WORK (1 << TIF_PERFMON_WORK) > > +#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) > > #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) > > #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) > > #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) > > #define _TIF_MCA_INIT (1 << TIF_MCA_INIT) > > #define _TIF_DB_DISABLED (1 << TIF_DB_DISABLED) > > #define _TIF_FREEZE (1 << TIF_FREEZE) > > +#define _TIF_RESTORE_RSE (1 << TIF_RESTORE_RSE) > > > > /* "work to do on user-return" bits */ > > -#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_PERFMON_WORK| > _TIF_SYSCALL_AUDIT|\ > > +#define TIF_ALLWORK_MASK (_TIF_SIGPENDING|_TIF_NOTIFY_RESUME| > _TIF_SYSCALL_AUDIT|\ > > _TIF_NEED_RESCHED| > _TIF_SYSCALL_TRACE|\ > > _TIF_RESTORE_SIGMASK) > > /* like TIF_ALLWORK_BITS but sans TIF_SYSCALL_TRACE or > TIF_SYSCALL_AUDIT */ > > Index: 2.6.23-rc2/arch/ia64/kernel/perfmon.c > > =================================================================== > > --- 2.6.23-rc2.orig/arch/ia64/kernel/perfmon.c 2007-08-04 > 10:49:55.000000000 +0800 > > +++ 2.6.23-rc2/arch/ia64/kernel/perfmon.c 2007-08-08 > 09:32:41.000000000 +0800 > > @@ -586,21 +586,6 @@ pfm_put_task(struct task_struct *task) > > } > > > > static inline void > > -pfm_set_task_notify(struct task_struct *task) > > -{ > > - struct thread_info *info; > > - > > - info = (struct thread_info *) ((char *) task + > IA64_TASK_SIZE); > > - set_bit(TIF_PERFMON_WORK, &info->flags); > > -} > > - > > -static inline void > > -pfm_clear_task_notify(void) > > -{ > > - clear_thread_flag(TIF_PERFMON_WORK); > > -} > > - > > -static inline void > > pfm_reserve_page(unsigned long a) > > { > > SetPageReserved(vmalloc_to_page((void *)a)); > > @@ -3730,7 +3715,7 @@ pfm_restart(pfm_context_t *ctx, void *ar > > > > PFM_SET_WORK_PENDING(task, 1); > > > > - pfm_set_task_notify(task); > > + tsk_set_notify_resume(task); > > > > /* > > * XXX: send reschedule if task runs on another CPU > > @@ -5087,7 +5072,7 @@ pfm_handle_work(void) > > > > PFM_SET_WORK_PENDING(current, 0); > > > > - pfm_clear_task_notify(); > > + tsk_clear_notify_resume(current); > > > > regs = task_pt_regs(current); > > > > @@ -5455,7 +5440,7 @@ pfm_overflow_handler(struct task_struct > > * when coming from ctxsw, current still > points to the > > * previous task, therefore we must work with > task and not current. > > */ > > - pfm_set_task_notify(task); > > + tsk_set_notify_resume(task); > > } > > /* > > * defer until state is changed (shorten spin window). > the context is locked > > - > > To unsubscribe from this list: send the line "unsubscribe > linux-ia64" in > > the body of a message to majordomo at vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > -Stephane > > From renzo at cs.unibo.it Mon Aug 13 21:02:50 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Mon, 13 Aug 2007 23:02:50 +0200 Subject: [PATCH] some optimizations for Virtual Machines Message-ID: <20070813210250.GA8985@cs.unibo.it> Roland (and utrace-devel community), I have just completed, together with Andrea Gasparini, a first implementation of a kernel module based on utrace as a fast support for our virtualization environment (view-os/umview). The name of the module is "kmview" kernel-mode-view-os, and the user level tool will have the same name. We will (GPL) release both module and user level program as soon as possible. utrace is a wondeful and well designed tool. However, IMHO, during the implementation of kmview we have found that there are some improvements that can be done (and that we have already implemented) for a better support of virtual machines. Here are some comments, I hope you'll share our ideas and you'll insert our improvements soon in utrace's mainstream code. 1- Order of callbacks You say: Engines are called in the order they attached. It is meaningful for kernel generated events but unfortunately it does not provide a significant semantics for engine nesting when applied to report_syscall_entry. When dealing with several tracing/virtual machine tools the report_syscall_entry callbacks must be evaluated in the reverse way. As an example I tried to use strace on a view-os like virtual machine (some syscalls get virtualized). strace works but being the last engine it shows the modified calls, but the return values of the original calls. Wrong order: syscall enter: call -> VM (modification) -> strace -> kernel syscall exit: call -> VM (restore) -> strace -> kernel Right order: syscall enter: call -> strace -> VM (modification) -> kernel syscall exit: call -> VM (restore) -> strace -> kernel Reversing the attached engine list traversal for syscall_entry solves the problem. 2- Access to traced process vm. Your interface provides the call utrace_access_process_vm: it allows tracer processes to use /dev/*/mem. Unfortunately write access is denied (as stated in fs/proc/base.c): > #define mem_write NULL > #ifndef mem_write > /* This is a security hazard */ .... The /dev/*/mem way to access process vm's would be useless anyway. When I write a virtual machine support for hundreds of processes I cannot keep hundreds of open files. On the other hand I cannot open and close file for each memory access: we need fast access! I propose a new call: int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string); which give I-O access to the memory of the process. It has about the same interface of access_process_vm (mm/memory.c) with the extra "string" option (significative only when write==0). Sometimes a read buffer can be significantly larger than the actual field used for a string. If string==1 the transfer terminates at '\0' avoiding the memory error that could arise for unallocated memory after the string (and a slight increase in performance). Prior to give access to the process vm, utrace_access_process_vm check the rights to do so using utrace_allow_access_process_vm (it has the same degree of protection of your access to /dev/*/mem). 3- In the patch I have also implemented the support for PTRACE_MULTI and PTRACE_SYSVM. These two extra features provide: -- PTRACE_MULTI: multiple PTRACE operation using one call, including data transfer of chunks of memory and registers. (it would speed up many commands, have a look of "strace strace ls", to see how many bursts of prace could collapse!). I designed this call for virtual machine support. -- PTRACE_SYSVM: can be used instead of PTRACE_SYSCALL or SYSEMU. At the end of the pre-syscall protocol it is possible to choose among three different behavior: i- call againg after the syscall (maybe some parameters gets modified by the virtualization. (like PTRACE_SYSCALL) ii- skip the upcall after the syscall but do perform the syscall (for a non virtualized call) iii- skip both the system call and the second upcall event (for a completely virtualized call). PTRACE_SYSVM almost half the number of context switches for Virtual Machines. (SYSEMU works just for total Virtual Machines, while SYSVM works also for partial Virtual Machines) There is a extensive description of SYSVM in some messages I sent some time ago on KDML. We already implemented these features on vanilla kernel, this verstion based on utrace is architecture independent. ------- THe complete patch is here: http://www.cs.unibo.it/~renzo/utrace/ Unfortunately it is against 2.4.22. I have a very slow connection to the Internet here, I'll try to update the patch to the latest kernel as soon as I return home. ciao renzo -- ============================================================================ Renzo Davoli | Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501 | Mura Anteo Zamboni, 7 Fax. +39 051 2094510 | I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 ============================================================================ From renzo at cs.unibo.it Mon Aug 13 21:09:29 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Mon, 13 Aug 2007 23:09:29 +0200 Subject: [PATCH] some optimizations for Virtual Machines In-Reply-To: <20070813210250.GA8985@cs.unibo.it> References: <20070813210250.GA8985@cs.unibo.it> Message-ID: <20070813210929.GB8985@cs.unibo.it> A bugfix for my ASCII ART ;-) Wrong order (utrace behavior now): syscall enter: process -VM (modification) -> strace -> kernel syscall exit: kernel -VM (restore) -> strace -> process Right order (proposed): syscall enter: process -> strace -> VM (modification) -> kernel syscall exit: kernel -> VM (restore) -> strace -> process Sorry for this trailing errata message. renzo From roland at redhat.com Tue Aug 14 09:13:15 2007 From: roland at redhat.com (Roland McGrath) Date: Tue, 14 Aug 2007 02:13:15 -0700 (PDT) Subject: [PATCH]utrace: IA64 RSE bug In-Reply-To: Shaohua Li's message of Wednesday, 8 August 2007 10:30:37 +0800 <1186540237.16482.2.camel@sli10-conroe.sh.intel.com> Message-ID: <20070814091315.E199E4D057D@magilla.localdomain> > > If the gist of your patch here seems right to ia64 folks, then I'd > > suggest you do a patch adding ia64_sync_krbs to the upstream kernel. > > That would be the necessary prelude to doing the arch_ptrace_stop plan, > > which I can help ia64 folks with later. > Ok, this is the patch against 2.6.23-rc2 That patch is for a kernel that has "tracehook" calls, which is not the upstream kernel. If it's for 2.6.23-rc2+linux-2.6.23-rc2-utrace.patch, that is not what was requested in the bit you just quoted. To reiterate, I think ia64 folks should consider a version of your patch for vanilla 2.6.23-rc3. It sounded like there was some agreement on re-integrating the flag bit conflicts. The flag bit integration and the meat of the patch is not particular to utrace. If you do the upstream version of this patch and ia64 folks consider it reasonable, then it can go into the vanilla ia64 kernel soon. Can you do that? As I said before, that patch would be the precursor to the arch_ptrace_stop support that makes more cleanup of the vanilla kernel's ia64 ptrace code possible. (Since this would now be a patch for the vanilla ia64 kernel, it would be misleading to keep it in a thread with "utrace" in the Subject: line.) If the low-level parts of the RSE-vs-debugger scheme will be going into the vanilla kernel, then I would much prefer to have that happen and then have the utrace ia64 patch be relative to that. Thanks, Roland From renzo at cs.unibo.it Tue Aug 14 17:41:19 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Tue, 14 Aug 2007 19:41:19 +0200 Subject: [PATCH] update: some optimizations for Virtual Machines In-Reply-To: <20070813210929.GB8985@cs.unibo.it> References: <20070813210250.GA8985@cs.unibo.it> <20070813210929.GB8985@cs.unibo.it> Message-ID: <20070814174119.GA8806@cs.unibo.it> Just a quick note to say that I have updated my patch. More precisely I have refined the virtualized syscall nesting. When there are more engines for a task and report-callbacks can change the status the quiescent state must be managed for each engine. syscall enter: process -> VM0 -> VM1 (modify) -> VM2 (second modification) -> kernel syscall exit: kernel -> VM2 (restore) -> VM1 (restore) -> VM0 -> process Both for syscall enter and exit the modification of VM2 must take place after VM1 has completed its job. If VM1 requires the quiescent state to compute its modification of the state VM2 report_syscall_entry has to wait for VM1 to finish its job. The same for restore in the opposite way. One more idea: The entry.S code provides the feature to skip the system call (by setting the syscall number to -1). This feature must be provided to VM nesting. The new patch provide the correct nesting and if VM1 (in the example) sets the syscall number to -1 it skips also VM2 report syscall report. In this case during syscall exit skips VM2, restore the syscall number and calls the report_syscall_exit from VM1 and then VM0. Maybe it is useless to call VM1 too (starting from VM0 given that the call has been skipped) but for now I do in this way for the sake of simmetry. The new patch implements this policy. Maybe the same idea (wait for quiescent state at each engine) must be applied to all the other report* that can change the status (e.g. report_signal?). renzo From renzo at cs.unibo.it Mon Aug 20 13:12:00 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Mon, 20 Aug 2007 15:12:00 +0200 Subject: [PATCH] some optimizations for Virtual Machines In-Reply-To: <20070813210250.GA8985@cs.unibo.it> References: <20070813210250.GA8985@cs.unibo.it> Message-ID: <20070820131200.GC25747@cs.unibo.it> We have updated the patch to the latest kernel. It is here: http://www.cs.unibo.it/~renzo/utrace. > 1- Order of callbacks This patch is crucial. Without this change no virtual machines can be nested if based on utrace. > > 2- Access to traced process vm. > This patch is very important: with this change VM hypervisors can access their process memory efficiently. > > I propose a new call: > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string); > which give I-O access to the memory of the process. I have seen that access_process_vm has been exported to modules: it is a change included in mainstream 2.6.23 from the first rc. so I have exported access_process_vm_user too. access_process_vm_user has two main differences with access_process_vm: - it copies a memory area directly from a process vm to the user space of current and viceversa (it uses an internal one page buffer, it does not need extra buffers or extra code loops). - it supports the "string" flag for reading: no useless copies of data after the end of string, no memory errors due to short string read into large buffers. utrace_access_process_vm can be kept or not: modules can call it instead of directly accessing access_process_vm_user when they need to check that the requesting process has the right to access the other process vm. > > 3- In the patch I have also implemented the support for PTRACE_MULTI > and PTRACE_SYSVM. This patch is just useful. We'll use it to compare the performance between umview and kmview. Several ptrace based application could benefit from these features (e.g. when they need to load chunks of memory or chunks of registers, burst of ptrace calls could be sent as a single call reducing the number of mode switches). That's all for now. renzo From adobriyan at sw.ru Tue Aug 21 09:00:44 2007 From: adobriyan at sw.ru (Alexey Dobriyan) Date: Tue, 21 Aug 2007 13:00:44 +0400 Subject: [PATCH] some optimizations for Virtual Machines In-Reply-To: <20070820131200.GC25747@cs.unibo.it> References: <20070813210250.GA8985@cs.unibo.it> <20070820131200.GC25747@cs.unibo.it> Message-ID: <20070821090044.GB17761@localhost.sw.ru> On Mon, Aug 20, 2007 at 03:12:00PM +0200, Renzo Davoli wrote: > http://www.cs.unibo.it/~renzo/utrace. > > 2- Access to traced process vm. > > > This patch is very important: with this change VM hypervisors can access > their process memory efficiently. > > > > I propose a new call: > > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string); "string" smells like a hack, someone will come up with his favourite structure and ask for flag for copying, say, single-linked lists. :( > > which give I-O access to the memory of the process. > I have seen that access_process_vm has been exported to modules: it is a > change included in mainstream 2.6.23 from the first rc. > so I have exported access_process_vm_user too. > access_process_vm_user has two main differences with > access_process_vm: > - it copies a memory area directly from a process vm to the user space > of current and viceversa (it uses an internal one page buffer, check for allocation failure > it does > not need extra buffers or extra code loops). > - it supports the "string" flag for reading: no useless copies of data > after the end of string, no memory errors due to short string read > into large buffers. > > utrace_access_process_vm can be kept or not: modules can call it instead > of directly accessing access_process_vm_user when they need to check > that the requesting process has the right to access the other process > vm. > > > > 3- In the patch I have also implemented the support for PTRACE_MULTI > > and PTRACE_SYSVM. PTRACE_MULTI is horrible, it is asking for pain with compat version. > Several ptrace based application could > benefit from these features (e.g. when they need to load chunks of > memory or chunks of registers, burst of ptrace calls could be sent as a > single call reducing the number of mode switches). Mode switches are fast. This is a reason why read(2) doesn't have batched version, and read is called waaay more often than ptrace. I also wonder if this was tested with list debugging on: iterating over RCU protected list backwards when prev pointers are poisoned shouldn't work. From renzo at cs.unibo.it Tue Aug 21 10:52:04 2007 From: renzo at cs.unibo.it (Renzo Davoli) Date: Tue, 21 Aug 2007 12:52:04 +0200 Subject: [PATCH] some optimizations for Virtual Machines In-Reply-To: <20070821090044.GB17761@localhost.sw.ru> References: <20070813210250.GA8985@cs.unibo.it> <20070820131200.GC25747@cs.unibo.it> <20070821090044.GB17761@localhost.sw.ru> Message-ID: <20070821105204.GC28503@cs.unibo.it> > > > int utrace_access_process_vm(struct task_struct *tsk, unsigned long addr, char __user *ubuf, int len, int write, int string); > > "string" smells like a hack, someone will come up with his favourite > structure and ask for flag for copying, say, single-linked lists. :( There are no system calls asking for linked lists as parameters. Instead there are many having string! All those having a pathname. Look at this: char *s=strdup(x); fd=open (s,O_RDONLY); THis is a quiet, safe chunk of user code. What do you do to grab the value of s (from a virtual machine monitor)? With ptrace you can do a loop of PEEK_DATA one for each word of memory, when there is a NULL byte you leave the loop. IF you are designing a Virtual Machine this is just a performance suicide. You could open /proc//mem, in this case either you keep one descriptor opened for each controlled process or you open /read/close the proc file per each access. Either a scalability or a performance nightmare. if you need to go fast something like: access_process_vm(....,address_of_s, PATH_MAX ...) can fail because the s could be in the unluky position at the end of an allocated partition. > > - it copies a memory area directly from a process vm to the user space > > of current and viceversa (it uses an internal one page buffer, > > check for allocation failure You mean: actual> mm = get_task_mm(tsk); actual> if (!mm) actual> return 0; actual> actual> buf=kmalloc(PAGE_SIZE, GFP_KERNEL); must be changed as: updated> mm = get_task_mm(tsk); updated> buf=kmalloc(PAGE_SIZE, GFP_KERNEL); updated> if (!mm || !buf) updated> return 0; Okay, you're right, I'll update the patch. > > > > 3- In the patch I have also implemented the support for PTRACE_MULTI > > > and PTRACE_SYSVM. > > PTRACE_MULTI is horrible, it is asking for pain with compat version. I cannot understand. If you do not use PTRACE_MULTI, ptrace works as usual. Old ptrace do not use PTRACE_MULTI so you do not need to support PTRACE_MULTI for backward compatibility with old versions. > > Mode switches are fast. This is a reason why read(2) doesn't have > batched version, and read is called waaay more often than ptrace. readv do exist. Mode switches are fast, but having less mode switches is even faster. > > I also wonder if this was tested with list debugging on: iterating over > RCU protected list backwards when prev pointers are poisoned shouldn't > work. I do not know if the reverse scan of the list can be done better or maybe my implementation is buggy. I say that we do need that reverse traversal for SYSCALL_ENTRY otherwise it is not possible to implement nested services based on utrace. Regarding the 3 points of my original message: 1- order of calls: the patch (or a different patch implementing the same idea) is needed, otherwise the support for nested engines become meaningless when dealing with system calls virtualization. 2- access to process vm: some solution is needed for a fast access to a utraced process vm. 3- I need this patch for my project, I feel that it could speed up some other programs, but this is not so crucial as 1 and 2. renzo From roland at redhat.com Tue Aug 28 07:07:20 2007 From: roland at redhat.com (Roland McGrath) Date: Tue, 28 Aug 2007 00:07:20 -0700 (PDT) Subject: utrace known bugs In-Reply-To: Alexey Dobriyan's message of Thursday, 2 August 2007 16:31:12 +0400 <20070802123112.GD6553@localhost.sw.ru> Message-ID: <20070828070720.CCF414D04CC@magilla.localdomain> > * unbounded utrace_engine_cache growth > started from 31a9ef5cfcdbae804e3e180c158bf2352728765a, > nobody knows why > testcase: at the end of http://marc.info/?l=linux-kernel&m=117128445312243&w=2 Is it correct that this only happens with CONFIG_PREEMPT=y? > * _pointer_ to struct utrace, which I personally count as design bug. > > Rationale to fold struct utrace into task_struct is that lifetime > rules of task_struct are well established, well tested and so on. As > was demonstrated it also removes much complexity from attaching logic. Ok, I'd be happy to discuss that in a separate thread. By "known bugs" I mean the symptoms and specific holes in the current implementation. An opinion about organizing the code is a fine thing, but not itself an item for that list. Thanks, Roland From adobriyan at sw.ru Tue Aug 28 08:48:15 2007 From: adobriyan at sw.ru (Alexey Dobriyan) Date: Tue, 28 Aug 2007 12:48:15 +0400 Subject: utrace known bugs In-Reply-To: <20070828070720.CCF414D04CC@magilla.localdomain> References: <20070802123112.GD6553@localhost.sw.ru> <20070828070720.CCF414D04CC@magilla.localdomain> Message-ID: <20070828084815.GB22417@localhost.sw.ru> On Tue, Aug 28, 2007 at 12:07:20AM -0700, Roland McGrath wrote: > > * unbounded utrace_engine_cache growth > > started from 31a9ef5cfcdbae804e3e180c158bf2352728765a, > > nobody knows why > > testcase: at the end of http://marc.info/?l=linux-kernel&m=117128445312243&w=2 > > Is it correct that this only happens with CONFIG_PREEMPT=y? I rechecked, and it appears that one needs 2 cpus for this, preemption doesn't matter. In UP case there is no bug regardless of PREEMPT and in SMP case there is bug, again, regardless of PREEMPT. From roland at redhat.com Tue Aug 28 10:47:40 2007 From: roland at redhat.com (Roland McGrath) Date: Tue, 28 Aug 2007 03:47:40 -0700 (PDT) Subject: utrace known bugs In-Reply-To: Alexey Dobriyan's message of Tuesday, 28 August 2007 12:48:15 +0400 <20070828084815.GB22417@localhost.sw.ru> Message-ID: <20070828104740.847E84D04CC@magilla.localdomain> Please edit http://sourceware.org/systemtap/wiki/utrace/bugs with any updates or additions to the list of specific misbehaviors of the implementation. (Feel free to add other pages under wiki/utrace/ to keep track of any other items of interest, such as implementation design ideas.) Thanks, Roland From roland at redhat.com Wed Aug 29 10:03:23 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 29 Aug 2007 03:03:23 -0700 (PDT) Subject: [PATCH] some optimizations for Virtual Machines In-Reply-To: Renzo Davoli's message of Monday, 13 August 2007 23:02:50 +0200 <20070813210250.GA8985@cs.unibo.it> Message-ID: <20070829100323.0D51E4D04D3@magilla.localdomain> Thanks very much for your interest in utrace! > 1- Order of callbacks This subject is on the TODO list, part of "Engine interaction issues". You are not the first to bring it up. I'd like to discuss the question more broadly before considering a particular change to the behavior. I'll do this in a separate thread. I'd also encourage you to write up thoughts about things like this on the wiki (make new pages under http://sourceware.org/systemtap/wiki/utrace/), if you are so inclined. > 2- Access to traced process vm. I didn't include any interfaces for this in the utrace core, because there isn't really any need. That is, there is no direct interaction with utrace internals required. Some utrace-using kernel code can just do the memory access itself, perhaps after using utrace to get threads quiescent (and call writeback, for ia64). You can use access_process_vm, or do something in more specific code that does what it does, i.e. use get_user_pages. In recent versions, access_process_vm is no longer exported to modules (though maybe it will be again). But get_user_pages is, and access_process_vm is a pretty simple function anyway. I'm not at all against some new, convenient, exported calls for doing this. But they do not need to be part of the utrace core. I think you should just use whatever is convenient to use in your own code now. When there are a few different things satisfying their own very similar needs like this, then something will gel for new shared interfaces to use. > 3- In the patch I have also implemented the support for PTRACE_MULTI > and PTRACE_SYSVM. Personally, I am not interested in additions to the ptrace interface at all. I don't intend to talk you out of doing it. I completely sympathize with your approach of adding what you most need to the facility that is already working now, and we do not yet have any nice post-ptrace user-level interface in production. I just don't care to comment on the details and merits of any new ptrace requests. What I am very interested in are two things about your work. First, what issues come up with the utrace/tracehook calls in detail in your work? (This being things aside from the engine interaction question.) In particular, tracehook_syscall_* come to mind. Are those doing what you need? Second, what needs/desires/bottlenecks in a higher-level tracing interface do you think are particular to the VM implementation case? Doing more operations with fewer context switches is a universal and well-known need, that anyone designing a new interface is well aware of. If there are unusual needs or usage patterns peculiar to your kind of use, I'd like to get more details about those into discussion. Doing tracehook_abort_syscall and fetching those details from the registers portably is the only thing I think of off hand. For some x86 stuff, maybe there is segmentation magic to be done too? In future, I think it would be most useful if you publish your changes as a series of separate patches (a la quilt). If you make a utrace core change, put that in a separate patch. If you add some utility function, separate the patch for that. Put your ptrace additions in a patch that's just about that. Then it's easier to review each piece of the puzzle on its own merits. Thanks, Roland From roland at redhat.com Wed Aug 29 12:15:10 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 29 Aug 2007 05:15:10 -0700 (PDT) Subject: engine interaction, callback order In-Reply-To: Renzo Davoli's message of Monday, 13 August 2007 23:02:50 +0200 <20070813210250.GA8985@cs.unibo.it> Message-ID: <20070829121510.549364D04D3@magilla.localdomain> Renzo Davoli brought up the issue of the order of callbacks to different engines for the same utrace event. I know someone working on uprobes brought this up before (probably on the systemtap mailing list some time ago). In both cases, the purpose behind the practical interest in callback order is the same. Both syscall interception and breakpoint assistance do emulation or "warping" kinds of things. In the breakpoint case, it's useful to execute copied instructions, so the true PC (and other register values) to be executed in user mode can be changed from the unmolested user program. In the syscall case, user register values are tweaked to affect whether and what actual syscalls can execute. These engines temporarily change the user state that executes. So, you want them to do their work last, so that other observing engines see the unmolested user state before they tweak it. But, you want them to do their work first, so they restore the user state to the proper values to be observed before other engines look. A related issue is an engine wanting the last word on a signal's disposition. You may want to be last in report_signal and force a particular disposition (like terminate) no matter what other engines had decided about the signal. Conversely, an engine filling the core dump function (or starting the bug-reporter robot or whatever) wants to be the last resort, taking over only if any other engines (active debuggers, etc.) want to let a fatal signal proceed. Whether you want your register values in place only at the last moment (to be run but not seen), or want your signal disposition decision to go in effect only at the last moment, the problem is the same. No order of callbacks alone helps you. Another engine can leave QUIESCE set when you've finished all your callbacks and cleared QUIESCE. Then your warped register values are there to be seen by any engine doing asynchronous regset access while it has QUIESCE set. Or, the signal disposition to take effect can be changed by utrace_inject_signal (or perhaps can't, in the current implementation). You don't have any way to react to what other engines are doing. utrace_inject_signal can't do any queuing--it's injecting the particular action (possibly fatal) to be taken the very next thing, not a signal to be considered and dispatched later. But then what are you supposed to do when another engine injected first and you get -EBUSY? There isn't an opportunity to get your word in there, to examine or replace the competing disposition scheduled. These issues lead to the idea of changing how report_quiesce and report_signal work. report_signal tells you "without intervention, user mode will resume from right here and do this". Currently report_quiesce tells you that user mode might resume now if permitted, but also tells you at some various places just that user mode is not running right now and it's safe to look. (At those latter, it's not about to get back to user mode (or terminate) without passing through some more event points, though it may be working and blocking nonquiescently before then.) So perhaps rename report_signal to report_resume, and call it when dequeuing a signal and when dequeuing none and preparing to return to user mode after having stopped for QUIESCE. That is, it's called when there is the opportunity to decide the disposition of resuming (fatal or signal handler or normal). Then do away with utrace_inject_signal entirely. The EVENT(SIGNAL_*) bits say an engine wants report_resume for those dispositions, EVENT(QUIESCE) says the engine wants it no matter what. An engine uses QUIESCE to get the thread to call report_resume. Then every interested engine gets the callback, and can see the disposition choice left by the last engine, and whether it's been left in QUIESCE. With report_resume as the center of "final user status" cooperation, then we can think about some sort of callback priority order covering the needs. For the syscall emulation warping, I think it may be a happy special case that can sidestep the whole complex engine interaction subject. It may make sense to have a first-class utrace feature via return values from report_syscall callbacks, for skipping a syscall and for restoring register values to their values before kernel entry. Rather than an engine using tracehook_abort_syscall et al, so that other engines can see its changes to the user registers or pseudo-register, the utrace core would do it after all callbacks and places engines could stop and look. I'd appreciate any feedback anyone has in any of these areas. Thanks, Roland From roland at redhat.com Thu Aug 30 06:36:49 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 29 Aug 2007 23:36:49 -0700 (PDT) Subject: [PATCH] late checking of permissions during PTRACE_ATTACH In-Reply-To: Alexey Dobriyan's message of Thursday, 2 August 2007 17:08:41 +0400 <20070802130841.GE6553@localhost.sw.ru> Message-ID: <20070830063649.880FE4D04CC@magilla.localdomain> Thanks very much for your report. I'm sorry there's been such a delay before I could follow it up. > ptrace_attach() does permissions check _after_ actual attaching. Given > that utrace_attach is quite non-trivial operation there is no way such > ordering should be allowed -- the following program should crash the box > in less than second. utrace_attach is intended to be a reasonably cheap operation. Anyway, we are never too concerned about the performance of an error case. The reason ptrace_attach does utrace_attach first is to preserve the order of error diagnoses. That is, to avoid calling ptrace_may_attach at all if ptrace_attach is going to fail because someone is already attached. This keeps it consistent with vanilla ptrace. In particular, security_ptrace should not be called when ptrace_attach fails for the "already attached" or "already dead" errors. Whether the call succeeds or fails, it may trigger security logging or whatnot that should not be done for these cases. The utrace_attach, utrace_detach sequence in a failing ptrace_attach is slightly costly. But it a) shouldn't be too bad and b) is just an error case. Moreover, it always ought to work without crashes whether it's a good idea or not! Your patch fixes what's not itself a problem, and thereby masks the actual problem that needs fixing. Fortunately, I can now reproduce this problem quickly using your test case. This is the utrace_detach bug you previously identified in http://lkml.org/lkml/2007/5/8/244, which is already #1 on the wiki utrace/bugs list. I wasn't using a good test case for it before, but this case hits it. I think some reasonable fixes are straightforward, and now I can try one and test it with some confidence using this test case. Thanks, Roland From fche at redhat.com Wed Sep 5 01:49:32 2007 From: fche at redhat.com (Frank Ch. Eigler) Date: Tue, 04 Sep 2007 21:49:32 -0400 Subject: external systemtap meeting notes 20070816 In-Reply-To: (Jim Keniston's message of "Tue, 04 Sep 2007 15:31:52 -0700") References: Message-ID: jkenisto wrote [untrimmed for forwarding to utrace-devel]: > [...] > Uprobes maintains one utrace engine per probed task. When > instrumentation modules A and B register uprobes at the same address, > and that probepoint gets hit, uprobes's signal (SIGTRAP) callback gets > called (once). It calls both A's and B's handlers (a la aggregate > kprobes), then eats the SIGTRAP. > > If A and B each got its own engine, then what would uprobes's callback > do? If the callback eats the SIGTRAP, then the 2nd handler never gets > called. If it doesn't eat the SIGTRAP, the probed process takes the > SIGTRAP and dies. We could work up some new way that A and B could > coordinate this sort of thing, but then we're back to needing some sort > of shared (among instrumentation modules) data structures. And what we > have now works just fine. This is probably related to Roland's "engine interaction" note (http://tinyurl.com/22jt2o). It would be nice if concurrent engines could nest transparently - or failing that, at least be aware of each other enough to get this right. - FChE From jkenisto at us.ibm.com Wed Sep 5 21:50:16 2007 From: jkenisto at us.ibm.com (Jim Keniston) Date: Wed, 05 Sep 2007 14:50:16 -0700 Subject: engine interaction, callback order In-Reply-To: <20070829121510.549364D04D3@magilla.localdomain> References: <20070829121510.549364D04D3@magilla.localdomain> Message-ID: <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> On Wed, 2007-08-29 at 05:15 -0700, Roland McGrath wrote: > Renzo Davoli brought up the issue of the order of callbacks to different > engines for the same utrace event. I know someone working on uprobes > brought this up before (probably on the systemtap mailing list some time > ago). Yeah. I think one concern was that when a breakpoint is hit, a utrace client other than uprobes might see the SIGTRAP first and take some unwarranted action because it thinks that the task is going to die. ... > > These issues lead to the idea of changing how report_quiesce and > report_signal work. I'm having trouble connecting all the dots in the discussion below. Maybe if you provide clarifications on the indicated points, I can give you better feedback. > report_signal tells you "without intervention, user > mode will resume from right here and do this". Currently report_quiesce > tells you that user mode might resume now if permitted, but also tells you > at some various places just that user mode is not running right now and > it's safe to look. (At those latter, it's not about to get back to user > mode (or terminate) without passing through some more event points, though > it may be working and blocking nonquiescently before then.) I don't understand the above sentence. In particular, your use of "though" puzzles me, and I'm not sure what you mean by "before then." When is "then?" > So perhaps > rename report_signal to report_resume, and call it when dequeuing a signal > and when dequeuing none and preparing to return to user mode after having > stopped for QUIESCE. What do you mean by "and when dequeuing none?" > That is, it's called when there is the opportunity to > decide the disposition of resuming (fatal or signal handler or normal). > Then do away with utrace_inject_signal entirely. The EVENT(SIGNAL_*) bits > say an engine wants report_resume for those dispositions, EVENT(QUIESCE) > says the engine wants it no matter what. Would there still be a report_quiesce callback? Would utrace call report_quiesce when entering quiescence and report_resume when leaving quiescence? report_signal passes several signal-related args. Would those also be passed to report_resume? > An engine uses QUIESCE to get the > thread to call report_resume. Then every interested engine gets the > callback, and can see the disposition choice left by the last engine, How? Encoded in the args to report_resume? > and > whether it's been left in QUIESCE. > > ... > > > I'd appreciate any feedback anyone has in any of these areas. > > > Thanks, > Roland > From roland at redhat.com Thu Sep 6 01:32:26 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 5 Sep 2007 18:32:26 -0700 (PDT) Subject: engine interaction, callback order In-Reply-To: Jim Keniston's message of Wednesday, 5 September 2007 14:50:16 -0700 <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> References: <20070829121510.549364D04D3@magilla.localdomain> <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> Message-ID: <20070906013226.82EA94D04D3@magilla.localdomain> > I'm having trouble connecting all the dots in the discussion below. > Maybe if you provide clarifications on the indicated points, I can give > you better feedback. I'm very glad to do it! It was a bit of a brain dump, just hoping to get everything on the table and have a real discussion begin. (It would be great if someone wrote up or editted expanded explanations on the wiki.) > > report_signal tells you "without intervention, user > > mode will resume from right here and do this". Currently report_quiesce > > tells you that user mode might resume now if permitted, but also tells you > > at some various places just that user mode is not running right now and > > it's safe to look. (At those latter, it's not about to get back to user > > mode (or terminate) without passing through some more event points, though > > it may be working and blocking nonquiescently before then.) > > I don't understand the above sentence. In particular, your use of > "though" puzzles me, and I'm not sure what you mean by "before then." > When is "then?" (I won't try to explain my grammar, just my meaning. :-) I was citing the distinction between the two kinds of places report_quiesce is now called. The latter category includes places like syscall tracing. Here report_quiesce does not mean that the very next thing the thread will do is return to user mode. In fact, if UTRACE_EVENT(QUIESCE) remains set, there will always be another callback (of the former category) before it gets to user mode. However ("though"), said second callback ("then") may not occur very quickly, since the thread (e.g. performing some syscall) may do active work or uninterruptible kinds of blocking beforehand. > > So perhaps > > rename report_signal to report_resume, and call it when dequeuing a signal > > and when dequeuing none and preparing to return to user mode after having > > stopped for QUIESCE. > > What do you mean by "and when dequeuing none?" I mean when there is nothing left to do but return to user mode. The place this happens is utrace_get_signal, called from get_signal_to_deliver. What's going on there is dequeuing all pending signals and acting on them, and then returning to user mode. i.e., it's a loop, and the final iteration is the one that decides there are no pending signals to dequeue. (If a signal is fatal, the thread dies in the middle of one of those iterations and never hits the natural end condition of the loop. If a signal causes job control stop, the thread stops in the middle of one of those iterations and then picks up in the loop again after SIGCONT.) Currently report_signal gets called for each iteration but the last. The idea above is that it (now called report_resume) would be called for each iteration, including the last. > Would there still be a report_quiesce callback? Would utrace call > report_quiesce when entering quiescence and report_resume when leaving > quiescence? > > report_signal passes several signal-related args. Would those also be > passed to report_resume? Yes. When I said "rename report_signal to report_resume", I meant it literally and precisely--change the name, not the signature. The difference in its calling convention would be that the "action" argument might initially be UTRACE_ACTION_RESUME, saying that there is no pending signal at all. As with report_signal now, a callback could return UTRACE_SIGNAL_* to say what should happen instead. In the no-pending-signal case, it would have to fill in info->si_signo to be nonzero and set return_ka if it didn't want SIG_DFL, since there will have been no original signal number there to start with. (That's the use that would replace utrace_inject_signal.) > > An engine uses QUIESCE to get the > > thread to call report_resume. Then every interested engine gets the > > callback, and can see the disposition choice left by the last engine, > > How? Encoded in the args to report_resume? Yes, as described above. Thanks, Roland From roland at redhat.com Thu Sep 6 02:29:59 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 5 Sep 2007 19:29:59 -0700 (PDT) Subject: engine interaction, callback order In-Reply-To: Roland McGrath's message of Wednesday, 5 September 2007 18:32:26 -0700 <20070906013226.82EA94D04D3@magilla.localdomain> References: <20070829121510.549364D04D3@magilla.localdomain> <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> <20070906013226.82EA94D04D3@magilla.localdomain> Message-ID: <20070906022959.08F924D04D3@magilla.localdomain> > > Would there still be a report_quiesce callback? Would utrace call > > report_quiesce when entering quiescence and report_resume when leaving > > quiescence? Sorry, I didn't answer this part. To be decided. Something like that. Perhaps report_quiesce should go away entirely. One idea is leaving it only for the other cases when not about to resume. Then report_resume would be called when first arriving at "almost back to user mode", with its "action" giving UTRACE_ACTION_QUIESCE if the flag is set. If the UTRACE_ACTION_QUIESCE state flag is still set after those callbacks, then we go into quiescence (TASK_TRACED). When all engines have cleared QUIESCE, we wake up, and run report_resume again with UTRACE_ACTION_RESUME (or with any pending signal ready to be delivered). Or perhaps it's clearer to leave report_quiesce, have only that run before stopping while quiescent, and then run report_resume after waking. (A report_resume callback can always decide to set QUIESCE again.) The benefit of the first idea is that then report_quiesce (if kept at all) means only the "you can stop, but I'm not near user mode" cases. If you want to e.g. look at the user mode registers in their proper form, you need to wait for report_resume. So then if you get report_quiesce before report_resume, it means there may be a delay before the opportunity to get at the registers freely (e.g. it may be safe to read but not write them at report_quiesce). Thanks, Roland From roland at redhat.com Thu Sep 13 01:28:06 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 12 Sep 2007 18:28:06 -0700 (PDT) Subject: resetting MSR_SE in single_step_exception Message-ID: <20070913012806.BF1754D0412@magilla.localdomain> Hi. Can you help me understand why arch/powerpc/kernel/traps.c's single_step_exception and emulate_single_step functions clear the trace flags (MSR_SE et al) on every trace trap? The users of this bit that I know in any detail are ptrace and kprobes. It looks to me like kprobes does not expect MSR_SE to be cleared automatically, though I am not entirely certain. ptrace does not depend on MSR_SE being cleared to run correctly. That is, it always explicitly clears it before resuming for non-stepping anyway. It may be relevant that ptrace users do not see the bits set when in msr they fetch the registers. But, though a user ptrace can set the bits while the target is stopped, at least MSR_SE is always cleared on resuming it. I'm not aware of what userland debuggers change msr and what they are doing, if any do. On x86, the processor single-step flag can also be set by unprivileged instructions. User programs do this, and catch their own SIGTRAPs, for strange purposes. Is it possible to set MSR_SE/MSR_BE with unprivileged instructions on powerpc? Dave Nomura discovered this issue when using my utrace patches on powerpc. This is why I'm asking. There is no problem with the current behavior for ptrace as implemented under utrace, either. Like vanilla ptrace, in the utrace implementation, whenever a thread has been stopped for tracing and then resumes, it will explicitly set or clear single-stepping every time. However, utrace also allows uses where the single-step event is handled (recorded, whatever) without stopping; Dave is experimenting with exactly that. In this case, the thread never goes through the "stop for tracing and wake up again" logic. On other machines, the flag just stays set in the user registers until explicitly cleared by debugger action. So that's what the existing generic utrace logic expects. I'm tentatively using this short patch (more text below): diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index d8502e3..0000000 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -525,7 +525,8 @@ void RunModeException(struct pt_regs *re void __kprobes single_step_exception(struct pt_regs *regs) { - regs->msr &= ~(MSR_SE | MSR_BE); /* Turn off 'trace' bits */ + if (!user_mode(regs) || !test_thread_flag(TIF_SINGLESTEP)) + regs->msr &= ~(MSR_SE | MSR_BE); /* Turn off 'trace' bits */ if (notify_die(DIE_SSTEP, "single_step", regs, 5, 5, SIGTRAP) == NOTIFY_STOP) @@ -545,7 +546,8 @@ void __kprobes single_step_exception(str static void emulate_single_step(struct pt_regs *regs) { if (single_stepping(regs)) { - clear_single_step(regs); + if (!user_mode(regs) || !test_thread_flag(TIF_SINGLESTEP)) + clear_single_step(regs); _exception(SIGTRAP, regs, TRAP_TRACE, 0); } } That is a conservative change that won't affect things like kprobes if they care. It will affect the msr seen by ptrace after PTRACE_SINGLESTEP, though not that seen when the PTRACE_SINGLESTEP'd instruction hits another signal, for example. But, I wonder if there is really a need to clear the bits here at all. Why not let whoever set them be responsible for clearing them? That's how we handle the analogous bit on x86 and some others. It seems good for overall comprehensibility to make the details of how we use similar hardware features on different hardware similar when possible. Unless there is a reason for it I don't understand, maybe these functions can change not to touch msr at all? Another alternative is that I can change utrace so that it always explicitly enables single-step anew after there has been a trap and single-step mode was left set. (In the utrace interface, it's a persistent thread state, not a one-shot command like PTRACE_SINGLESTEP.) But if we can (continue to) avoid this extra code path when not stopping, that seems preferable. Looking at this made me notice MSR_BE. The powerpc manual I have says that implementing branch tracing is optional. Do common chips in fact implement it? How can you tell if it is supported (at compile time or via runtime chip feature bits)? Thanks, Roland From ananth at in.ibm.com Thu Sep 13 06:12:20 2007 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Thu, 13 Sep 2007 11:42:20 +0530 Subject: resetting MSR_SE in single_step_exception In-Reply-To: <20070913012806.BF1754D0412@magilla.localdomain> References: <20070913012806.BF1754D0412@magilla.localdomain> Message-ID: <20070913061220.GA14776@in.ibm.com> On Wed, Sep 12, 2007 at 06:28:06PM -0700, Roland McGrath wrote: > Hi. Can you help me understand why arch/powerpc/kernel/traps.c's > single_step_exception and emulate_single_step functions clear the trace > flags (MSR_SE et al) on every trace trap? > > The users of this bit that I know in any detail are ptrace and kprobes. > It looks to me like kprobes does not expect MSR_SE to be cleared > automatically, though I am not entirely certain. On powerpc, since single_step_exception() explicitly resets MSR_SE, kprobes doesn't reset it again. However, in the post_handler, if we see the MSR_SE bit set, its deemed that some other user explicitly wants to singlestep and kprobes returns NOTIFY_DONE so the user who set the bit can be notified. This also plays nicely if someone is stepping through the code using xmon. Ananth From dcnltc at us.ibm.com Tue Sep 18 14:28:58 2007 From: dcnltc at us.ibm.com (Dave Nomura) Date: Tue, 18 Sep 2007 07:28:58 -0700 Subject: utrace on ppc not working In-Reply-To: <20070913002230.B0AE94D0412@magilla.localdomain> References: <46E821A4.20206@us.ibm.com> <20070913002230.B0AE94D0412@magilla.localdomain> Message-ID: <46EFE0AA.60808@us.ibm.com> Roland, Thanks for the quick response. I am aware of some of the special instructions that need special care when single stepping on the PPC (but can't say I really understand the details). Don't you think this is the sort of thing that utrace single-stepping should handle so that every utrace client doesn't have to figure this out? gdb identifies certain non-steppable instr sequences and does some flow analysis to set a bpt at the first steppable instr. Or, perhaps this should be done in the kernel, arch//kernel/traps.c. In the arch=powerpc cash it looks like there is code there to emulate single stepping in certain scenarios. Paul? Thanks, Dave Roland McGrath wrote: > Yes, I see the problem too and have figured it out. You were seeing only > the hits after system calls, and the first hit. It's fixed by the > following patch. I'm going to ask some ppc folks about this before > deciding for sure this is the right change. > > With the fix, your test will go into an infinite loop. That is "correct" > behavior, not any bug (but a known issue for single-stepping in general). > This is because some stdio functions use locking (even if the program is > not multithreaded), so there is a lwarx, stwcx., bne sequence. A > single-step always breaks the lwarx+stwcx. sequence so it has to repeat. > > This is a known issue for stepping atomic sequences on several kinds of > processor. I have some thoughts relating to that, but it's not an issue > with any quick fixes. If you want to discuss that, please do so on the > utrace-devel at redhat.com mailing list. > > > Thanks, > Roland > > > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c > index d8502e3..0000000 100644 > --- a/arch/powerpc/kernel/traps.c > +++ b/arch/powerpc/kernel/traps.c > @@ -525,7 +525,8 @@ void RunModeException(struct pt_regs *re > > void __kprobes single_step_exception(struct pt_regs *regs) > { > - regs->msr &= ~(MSR_SE | MSR_BE); /* Turn off 'trace' bits */ > + if (!test_thread_flag(TIF_SINGLESTEP)) > + regs->msr &= ~(MSR_SE | MSR_BE); /* Turn off 'trace' bits */ > > if (notify_die(DIE_SSTEP, "single_step", regs, 5, > 5, SIGTRAP) == NOTIFY_STOP) > @@ -545,7 +546,8 @@ void __kprobes single_step_exception(str > static void emulate_single_step(struct pt_regs *regs) > { > if (single_stepping(regs)) { > - clear_single_step(regs); > + if (!test_thread_flag(TIF_SINGLESTEP)) > + clear_single_step(regs); > _exception(SIGTRAP, regs, TRAP_TRACE, 0); > } > } > -- Dave Nomura LTC Linux Power Toolchain From roland at redhat.com Wed Sep 26 07:59:58 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 26 Sep 2007 00:59:58 -0700 (PDT) Subject: tracing-induced signals (engine interaction, callback order) In-Reply-To: Jim Keniston's message of Wednesday, 5 September 2007 14:50:16 -0700 <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> References: <20070829121510.549364D04D3@magilla.localdomain> <1189029016.4393.41.camel@dyn9047018096.beaverton.ibm.com> Message-ID: <20070926075958.DF4B94D04B7@magilla.localdomain> > Yeah. I think one concern was that when a breakpoint is hit, a utrace > client other than uprobes might see the SIGTRAP first and take some > unwarranted action because it thinks that the task is going to die. There are two components of this issue. The issue is that tracing activities induce signals that would not ordinarily be raised in the program. Inserted breakpoints do this; so does single-step; so do hw breakpoint features; etc. First is the problem of interaction with signals generally. This is already a problem in vanilla ptrace. Induced signals interact with the normal functioning of the system, not just with competing tracers. A straightforward example is when the debugger suddenly dies. If it had been using single-step or suchlike, there may be a pending SIGTRAP that was already posted before the debugger exited and detached, but had not yet been processed by the debugger. This is easy to achieve with a wedged debugger, for example: it hasn't called wait4 to eat the signal yet, and then it gets killed and never does. Now the thread that was formerly traced resumes running, and delivers the SIGTRAP and dumps core. Some kernel behavior tweaks to avoid that particular scenario are easy to come up with. But that is just the simplest example of the issue. More subtle bad effects range from perturbing the signal queue resource accounting to interfering with the program's own use of blocked and pending signals. (Today it's effectively impossible to do some normal debugging activities with a program that itself catches and generates SIGTRAP.) In summary, the first component is distinguishing tracing events from real signals. The second problem is engine interaction per se. My view is that the crux of the former problem is the whole notion of overloading the signals mechanism for reporting tracing-induced events. Signals have one essential characteristic we need: they can be queued/posted quickly and safely at interrupt level, to take effect only at the safe, unentangled place just before returning to user mode. My thinking is to have a mechanism with this property separate from signals. The "extension events" item on the TODO list is such a mechanism. That feature idea is intended to serve a variety of needs. Among them is the idea to use these for machine-level events that are being traced. So rather than catching signals, an engine inducing traps would register its interest in the "single-stepped" event or the "breakpoint insn" event. In cases like the trap for a breakpoint instruction, there is nothing at the lowest level to distinguish a tracing-induced breakpoint from a user's other use of those instructions (intended to generate a SIGTRAP). So this event would be reported to interested tracing engines, and if no engine consumes it, then it can become a signal. I think the interface can be devised such that a detaching engine always gets a callback for any last events, so it's harder to write an engine that forgets to swallow the last event it induced and leaves someone to die by SIGTRAP. The mechanism to support that is a little way off. But this is how I see that going. In that context, engine interaction seems a lot simpler. Engines that induce low-level events have to sort those out among each other (probably easy from PC values and such). Engines interested in fatal signals only ever see "natural" signals. Those engines have to work out among themselves who eats the signal last. The report_resume interface and simple priorities for callback order seem like they might suffice. Thanks, Roland From roland at redhat.com Wed Sep 26 08:42:29 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 26 Sep 2007 01:42:29 -0700 (PDT) Subject: external systemtap meeting notes 20070816 In-Reply-To: Frank Ch. Eigler's message of Tuesday, 4 September 2007 21:49:32 -0400 References: Message-ID: <20070926084229.B17714D04B7@magilla.localdomain> > > If A and B each got its own engine, then what would uprobes's callback > > do? If the callback eats the SIGTRAP, then the 2nd handler never gets > > called. If it doesn't eat the SIGTRAP, the probed process takes the > > SIGTRAP and dies. We could work up some new way that A and B could > > coordinate this sort of thing, but then we're back to needing some sort > > of shared (among instrumentation modules) data structures. And what we > > have now works just fine. > > This is probably related to Roland's "engine interaction" note > (http://tinyurl.com/22jt2o). It would be nice if concurrent engines > could nest transparently - or failing that, at least be aware of each > other enough to get this right. cf https://www.redhat.com/archives/utrace-devel/2007-September/msg00007.html The future non-signal mechanism I described there can have a reporting interface that lets an engine say "I induced this" while letting other engines still see it, in case two have that same answer for a single occurrence. That is, it wouldn't cause any signal if any engines at all claimed it, yet multiple engines could always see it. This addresses the part of the breakpoint interaction problem that Jim described here. The other part of the problem is insertion/removal. Naive non-cooperation works if they literally nest, but not if removal order is not LIFO. I don't have any implicit-communication solution for that off hand. Thanks, Roland From roland at redhat.com Wed Sep 26 10:28:50 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 26 Sep 2007 03:28:50 -0700 (PDT) Subject: utrace on ppc not working In-Reply-To: Dave Nomura's message of Tuesday, 18 September 2007 07:28:58 -0700 <46EFE0AA.60808@us.ibm.com> References: <46E821A4.20206@us.ibm.com> <20070913002230.B0AE94D0412@magilla.localdomain> <46EFE0AA.60808@us.ibm.com> Message-ID: <20070926102850.DCB0B4D04B7@magilla.localdomain> > Thanks for the quick response. I am aware of some of the special > instructions that need special care when single stepping on the PPC (but > can't say I really understand the details). SMP locking is done by sequences of at least three instructions that must all three execute uninterrupted to succeed. If the sequence is interrupted for any reason, it will loop and repeat to try again. Normal single-step always interrupts the sequence, so it can never succeed. > Don't you think this is the sort of thing that utrace single-stepping > should handle so that every utrace client doesn't have to figure this > out? gdb identifies certain non-steppable instr sequences and does some > flow analysis to set a bpt at the first steppable instr. I wholeheartedly agree that this should be implemented in code shared by all engines that need it. It is already on my list. However, that does not mean that the lowest-level feature should do that. The basic simple semantics (so far always in hardware) is what UTRACE_ACTION_SINGLESTEP enables. > Or, perhaps this should be done in the kernel, > arch//kernel/traps.c. In the arch=powerpc cash it looks like there > is code there to emulate single stepping in certain scenarios. Paul? When the lowest-level implementation actually works by pure emulation, then I can see more of an argument for having that do the fancy business directly. This is not the case on powerpc. The code you're referring to has to do with emulating unsupported chip features and the like, and is far from a complete emulation. The basic implementation of single-step is just the hardware feature. I have discussed before decomposing the instruction-level work involved in the SSOOL implementations (a la kprobes and uprobes). I've talked about this for the purpose of breakpoint assistance. What we might call step assistance will be another beneficiary of that modularization. For machines without hardware single-step and where pure emulation is not desireable or easy enough to implement, step assistance can simulate it. For machines without hardware block-step, step assistance can simulate that. The facility for step assistance can identify atomic-sequence instructions so a stepper can decide to treat them specially. Stepping over an atomic sequence in the desired way is almost the same as block-step simulation. In this manner, I anticipate filling the needs for smarter stepping in a shared component that all clients wanting it will use. This approach has many complications and entanglements. It may be impractical for a very naive client to employ it in all situations without bad interactions it doesn't understand. It's not entirely trivial to implement all the pieces of the puzzle for a particular architecture. Special treatment of any instructions (such as atomic sequences) certainly breaks strict compatibility with what PTRACE_SINGLESTEP has meant before (even if that seems desireable, it may have unanticipated consequences for some existing application). For these reasons, I do not see it being integrated into the lowest level of the implementation as you suggest. Pure emulation does not suffer from most of these issues. Its performance is probably fine (taking traps is expensive too). But, on many machines it may be an excessive effort to implement and maintain with robust handling of all corners of behavior of all instructions. Thanks, Roland From fche at redhat.com Wed Sep 26 12:01:15 2007 From: fche at redhat.com (Frank Ch. Eigler) Date: Wed, 26 Sep 2007 08:01:15 -0400 Subject: external systemtap meeting notes 20070816 In-Reply-To: <20070926084229.B17714D04B7@magilla.localdomain> References: <20070926084229.B17714D04B7@magilla.localdomain> Message-ID: <20070926120115.GT8964@redhat.com> Hi - On Wed, Sep 26, 2007 at 01:42:29AM -0700, Roland McGrath wrote: > [...] The future non-signal mechanism I described there can have a > reporting interface [...] The other part of the problem is > insertion/removal. Naive non-cooperation works if they literally > nest, but not if removal order is not LIFO. I don't have any > implicit-communication solution for that off hand. Yeah, this is roughly why we pointed out some time back that the utrace layer would be well situated to provide a high-level breakpoint-related API. What do you suggest in the interim? Would this hack work: have the second utrace engine refuse to put a breakpoint wherever it suspects another engine may have put one? Or even more pessimistically, can an engine know that another one is already monitoring a given target process, and give up at attach time? (That would defeat some of the promise of utrace, but so it goes.) - FChE -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From roland at redhat.com Wed Sep 26 20:33:56 2007 From: roland at redhat.com (Roland McGrath) Date: Wed, 26 Sep 2007 13:33:56 -0700 (PDT) Subject: external systemtap meeting notes 20070816 In-Reply-To: Frank Ch. Eigler's message of Wednesday, 26 September 2007 08:01:15 -0400 <20070926120115.GT8964@redhat.com> References: <20070926084229.B17714D04B7@magilla.localdomain> <20070926120115.GT8964@redhat.com> Message-ID: <20070926203356.F26164D04B7@magilla.localdomain> > Would this hack work: have the second utrace engine refuse to put a > breakpoint wherever it suspects another engine may have put one? Sure, it's easy to notice the breakpoint instruction already there. > Or even more pessimistically, can an engine know that another one is > already monitoring a given target process, and give up at attach time? > (That would defeat some of the promise of utrace, but so it goes.) Yes, it can (UTRACE_ATTACH_EXCLUSIVE). Thanks, Roland From basavarajjm at gmail.com Thu Sep 27 13:39:35 2007 From: basavarajjm at gmail.com (Basavaraj Mathapati) Date: Thu, 27 Sep 2007 19:09:35 +0530 Subject: Required utrace patch file for suse linux kernel Message-ID: Hi, Iam new to the list Iam planning to work on uprobes, my system is running with suse linux (kernel version ?linux-2.6.16.46-0.12, archi-i386 and x86-64) . I tried with uprobe patch to linux kernel successfully, but not able to execute the programs since it's stating that utrace and utracehooks not found. As my kernel is not with utrace patch applied. Finally I tried with utrace patch downloaded form the given link inux-2.6-utrace.patch from http://people.redhat.com/roland/utrace/2.6-current/ , but some of hunks succeeded and some failed too. Failures given below: patching file fs/binfmt_elf_fdpic.c Hunk #1 FAILED at 427. 1 out of 1 hunk FAILED -- saving rejects to file fs/binfmt_elf_fdpic.c.rej patching file fs/binfmt_elf.c Hunk #1 succeeded at 1003 (offset -33 lines). patching file fs/binfmt_aout.c Hunk #1 succeeded at 445 (offset -7 lines). patching file fs/exec.c Hunk #2 succeeded at 983 (offset -167 lines). Hunk #3 succeeded at 1123 (offset -171 lines). patching file fs/proc/base.c Hunk #1 FAILED at 67. Hunk #2 succeeded at 329 with fuzz 1 (offset 178 lines). Hunk #3 succeeded at 447 with fuzz 2 (offset 214 lines). Hunk #4 succeeded at 837 with fuzz 2 (offset 275 lines). Hunk #5 succeeded at 864 (offset 275 lines). Hunk #6 FAILED at 909. 2 out of 6 hunks FAILED -- saving rejects to file fs/proc/base.c.rej patching file fs/proc/array.c Hunk #1 succeeded at 74 (offset -1 lines). Hunk #2 FAILED at 158. Hunk #3 FAILED at 178. Hunk #4 FAILED at 446. 3 out of 4 hunks FAILED -- saving rejects to file fs/proc/array.c.rej patching file fs/binfmt_som.c Hunk #1 succeeded at 272 (offset -13 lines). patching file fs/binfmt_flat.c Hunk #1 succeeded at 875 (offset -39 lines). patching file security/selinux/include/objsec.h Hunk #1 FAILED at 35. 1 out of 1 hunk FAILED -- saving rejects to file security/selinux/include/objsec.h.rej patching file security/selinux/hooks.c Hunk #1 succeeded at 22 (offset -1 lines). Hunk #2 succeeded at 154 (offset -8 lines). Hunk #3 succeeded at 1278 (offset -93 lines). Hunk #4 succeeded at 1704 (offset -143 lines). Hunk #5 succeeded at 2578 with fuzz 2 (offset -187 lines). Hunk #6 succeeded at 4094 (offset -453 lines). Hunk #7 FAILED at 4183. 1 out of 7 hunks FAILED -- saving rejects to file security/selinux/hooks.c.rej patching file arch/s390/kernel/ptrace.c Hunk #2 FAILED at 87. .........................continues So can you plz suggest me which utarce and uprobe patch to be applied to the above linux kernel? With some links to the docs on how to start Regards Basavaraj -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwielaard at redhat.com Fri Sep 28 08:36:54 2007 From: mwielaard at redhat.com (Mark Wielaard) Date: Fri, 28 Sep 2007 10:36:54 +0200 Subject: external systemtap meeting notes 20070816 In-Reply-To: <20070926203356.F26164D04B7@magilla.localdomain> References: <20070926084229.B17714D04B7@magilla.localdomain> <20070926120115.GT8964@redhat.com> <20070926203356.F26164D04B7@magilla.localdomain> Message-ID: <1190968614.3866.10.camel@dijkstra.wildebeest.org> On Wed, 2007-09-26 at 13:33 -0700, Roland McGrath wrote: > > Would this hack work: have the second utrace engine refuse to put a > > breakpoint wherever it suspects another engine may have put one? > > Sure, it's easy to notice the breakpoint instruction already there. Would the second engine then also get an event notification when the existing breakpoint gets hit? If so then it would be nice to have a mechanism to detect a breakpoint being removed so an engine could say "wait! I am still interested in monitoring when that address gets hit, please keep that breakpoint there", and then just have the first engine not be notified of any breakpoint events anymore, and only notify the second engine from then on. (This is kind of what we do in Frysk, but there of course we do all breakpoint handling in user space through one ptrace wait monitoring thread that dispatches the events, so we are in control of all breakpoint setting and breakpoint hit event detection.) Cheers, Mark