From andrew at walrond.org Fri Apr 2 06:53:31 2004 From: andrew at walrond.org (Andrew Walrond) Date: Fri, 2 Apr 2004 07:53:31 +0100 Subject: nptl compatibility with gcc, binutils and glibc.. In-Reply-To: <4.3.2.7.2.20040324092237.01ead630@mira-sjc5-4.cisco.com> References: <4.3.2.7.2.20040324092237.01ead630@mira-sjc5-4.cisco.com> Message-ID: <200404020753.31892.andrew@walrond.org> On Wednesday 24 Mar 2004 17:24, anish sheth wrote: > hello, > > I am trying to use nptl with glibc 2.3.2 ( downloaded from gnu.org ). I am > also using gcc-3.2.3 and binutils 2.13.2 ( both from gnu.org). > Use glibc HEAD from cvs, which inludes the latest nptl Use latest gcc (currently 3.3.3) Use latest binutils (currently 2.15.90.0.1.1) From Tony.Reix at bull.net Fri Apr 9 14:12:32 2004 From: Tony.Reix at bull.net (Tony Reix) Date: Fri, 9 Apr 2004 16:12:32 +0200 Subject: Web-site about NPTL tests and trace Message-ID: <200404091412.QAA237922@isatis.frec.bull.fr> Hi, People interested with our work (add new test/stress programs to NPTL and define/provide a NPTL trace mechanism) should look at: http://nptl.bullopensource.org/ - Test/Stress : We have analyzed the current coverage of NPTL. Now we are defining the new tests to be added. Once these new tests are defined, volunteers are welcome for helping writing them ! - Tracing : We have produced a draft v0.2 describing a NPTL trace tool. Though several people are already invited to provide comments, anyone is welcome in helping us defining this tool and in providing us with requirements and information. Tony Reix Carpe Diem From kukuk at suse.de Thu Apr 15 05:50:54 2004 From: kukuk at suse.de (Thorsten Kukuk) Date: Thu, 15 Apr 2004 07:50:54 +0200 Subject: problems with pthread_cond_broadcast Message-ID: <20040415055054.GA17099@suse.de> Hi, I have a problem with pthread_cond_wait/pthread_cond_broadcast waiting sometimes forever on a fast SMP machine. Attached is a simple test case. If I use the order pthread_mutex_unlock (&lock); pthread_cond_broadcast (&pcond); with NPTL, the program will hang after a short time running with current glibc + NPTL + kernel 2.6.x on all architectures I tested. If I revert the order to pthread_cond_broadcast (&pcond); pthread_mutex_unlock (&lock); it works fine. Is this a problem of the test case (since pthread_cond_broadcast and pthread_cond_wait will access pcond at the same time in different threads) or is this a glibc/NPTL/kernel problem? Thanks for any hint, Thorsten -- Thorsten Kukuk http://www.suse.de/~kukuk/ kukuk at suse.de SuSE Linux AG Maxfeldstr. 5 D-90409 Nuernberg -------------------------------------------------------------------- Key fingerprint = A368 676B 5E1B 3E46 CFCE 2D97 F8FD 4E23 56C6 FB4B -------------- next part -------------- #include #include #include #include #define BROKEN 1 #define DEFAULT_NUM_THREADS 64 /* n_readers >= 0 means 0 or more readers */ /* n_readers < 0 means a writer */ pthread_mutex_t lock; int n_readers; pthread_cond_t pcond; void init_rwlock (void) { n_readers = 0; pthread_mutex_init (&lock, NULL); pthread_cond_init (&pcond, NULL); } void rw_lock_write () { pthread_mutex_lock (&lock); while (n_readers != 0) pthread_cond_wait (&pcond, &lock); n_readers = -1; pthread_mutex_unlock (&lock); } void rw_unlock_write (void) { pthread_mutex_lock (&lock); n_readers = 0; #if BROKEN pthread_mutex_unlock (&lock); pthread_cond_broadcast (&pcond); #else pthread_cond_broadcast (&pcond); pthread_mutex_unlock (&lock); #endif } void * thread_fn (void *data) { for (;;) { rw_lock_write (); fprintf (stderr, "."); rw_unlock_write (); } return NULL; } int main (int argc, char *argv[]) { pthread_t *threads = NULL; int num_threads = DEFAULT_NUM_THREADS; int i; init_rwlock (); threads = malloc (num_threads * sizeof (pthread_t)); for (i = 0; i < num_threads; i++) { assert (pthread_create (&threads[i], NULL, thread_fn, NULL) == 0); } for (i = 0; i < num_threads; i++) { pthread_join (threads[i], NULL); } return 0; } From sebastien.decugis at ext.bull.net Thu Apr 15 07:27:10 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 15 Apr 2004 09:27:10 +0200 Subject: problems with pthread_cond_broadcast In-Reply-To: <20040415055054.GA17099@suse.de> References: <20040415055054.GA17099@suse.de> Message-ID: <1082014030.1455.46.camel@decugiss.frec.bull.fr> Hi, I think this problem is the same as: http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115349 and http://bugme.osdl.org/show_bug.cgi?id=2364 Moreover, POSIX says: "The pthread_cond_broadcast() or pthread_cond_signal() functions may be called by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits" So there is a bug in either NPTL or kernel futexes (FUTEX_REQUEUE ?), as your use of pthread_cond_broadcast() after pthread_mutex_unlock() is legal. Best regards, Sebastien. Le jeu 15/04/2004 ? 07:50, Thorsten Kukuk a ?crit : > Hi, > > I have a problem with pthread_cond_wait/pthread_cond_broadcast > waiting sometimes forever on a fast SMP machine. Attached is a > simple test case. > > If I use the order > pthread_mutex_unlock (&lock); > pthread_cond_broadcast (&pcond); > > with NPTL, the program will hang after a short time running with > current glibc + NPTL + kernel 2.6.x on all architectures I tested. > > If I revert the order to > pthread_cond_broadcast (&pcond); > pthread_mutex_unlock (&lock); > > it works fine. > > Is this a problem of the test case (since pthread_cond_broadcast and > pthread_cond_wait will access pcond at the same time in different > threads) or is this a glibc/NPTL/kernel problem? > > Thanks for any hint, > > Thorsten -- S?bastien DECUGIS Bull S.A. From kukuk at suse.de Thu Apr 15 07:30:38 2004 From: kukuk at suse.de (Thorsten Kukuk) Date: Thu, 15 Apr 2004 09:30:38 +0200 Subject: problems with pthread_cond_broadcast In-Reply-To: <1082014030.1455.46.camel@decugiss.frec.bull.fr> References: <20040415055054.GA17099@suse.de> <1082014030.1455.46.camel@decugiss.frec.bull.fr> Message-ID: <20040415073038.GA20694@suse.de> On Thu, Apr 15, Sebastien Decugis wrote: > Hi, > > I think this problem is the same as: > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115349 > and > http://bugme.osdl.org/show_bug.cgi?id=2364 > > Moreover, POSIX says: > "The pthread_cond_broadcast() or pthread_cond_signal() functions may be > called by a thread whether or not it currently owns the mutex that > threads calling pthread_cond_wait() or pthread_cond_timedwait() have > associated with the condition variable during their waits" > > So there is a bug in either NPTL or kernel futexes (FUTEX_REQUEUE ?), as > your use of pthread_cond_broadcast() after pthread_mutex_unlock() is > legal. Yes, it is legal to call pthread_cond_broadcast() after pthread_mutex_unlock(). But POSIX above says "have associated with the condition variable during their waits". This means for me, that they called already pthread_cond_wait() and not that they are allowed to call pthread_cond_wait() at the same time as pthread_cond_broadcast() with the same condition variable. I cannot find anything in POSIX which says, that it is allowed that pthread_cond_wait() and pthread_cond_broadcast() access the same variable at the same time, for me it always sounds like POSIX expects that other threads are already waiting. So are both functions allowed to access the same variable at the same time or not? I would expect that they are not allowed to do so and cannot find something different in POSIX. Thorsten > Le jeu 15/04/2004 ? 07:50, Thorsten Kukuk a ?crit : > > Hi, > > > > I have a problem with pthread_cond_wait/pthread_cond_broadcast > > waiting sometimes forever on a fast SMP machine. Attached is a > > simple test case. > > > > If I use the order > > pthread_mutex_unlock (&lock); > > pthread_cond_broadcast (&pcond); > > > > with NPTL, the program will hang after a short time running with > > current glibc + NPTL + kernel 2.6.x on all architectures I tested. > > > > If I revert the order to > > pthread_cond_broadcast (&pcond); > > pthread_mutex_unlock (&lock); > > > > it works fine. > > > > Is this a problem of the test case (since pthread_cond_broadcast and > > pthread_cond_wait will access pcond at the same time in different > > threads) or is this a glibc/NPTL/kernel problem? > > > > Thanks for any hint, > > > > Thorsten > -- > S?bastien DECUGIS > Bull S.A. -- Thorsten Kukuk http://www.suse.de/~kukuk/ kukuk at suse.de SuSE Linux AG Maxfeldstr. 5 D-90409 Nuernberg -------------------------------------------------------------------- Key fingerprint = A368 676B 5E1B 3E46 CFCE 2D97 F8FD 4E23 56C6 FB4B From slamb at slamb.org Thu Apr 15 07:43:01 2004 From: slamb at slamb.org (Scott Lamb) Date: Thu, 15 Apr 2004 02:43:01 -0500 Subject: problems with pthread_cond_broadcast In-Reply-To: <20040415055054.GA17099@suse.de> References: <20040415055054.GA17099@suse.de> Message-ID: <85784573-8EB0-11D8-9E50-000A95891440@slamb.org> On Apr 15, 2004, at 12:50 AM, Thorsten Kukuk wrote: > > Hi, > > I have a problem with pthread_cond_wait/pthread_cond_broadcast > waiting sometimes forever on a fast SMP machine. Attached is a > simple test case. > > If I use the order > pthread_mutex_unlock (&lock); > pthread_cond_broadcast (&pcond); > > with NPTL, the program will hang after a short time running with > current glibc + NPTL + kernel 2.6.x on all architectures I tested. > > If I revert the order to > pthread_cond_broadcast (&pcond); > pthread_mutex_unlock (&lock); > > it works fine. > > Is this a problem of the test case (since pthread_cond_broadcast and > pthread_cond_wait will access pcond at the same time in different > threads) or is this a glibc/NPTL/kernel problem? I think there is a problem in your test case. And I _sort_of_ see it. Imagine this sequence of events: Thread 1 Thread 2 ---------------------------------------- rw_lock_write: rw_lock_write: mutex_lock n_readers = -1 mutex_unlock mutex_lock n_readers != 0 rw_unlock_write: mutex_lock n_readers = 0 mutex_unlock cond_broadcast cond_wait thread 2 is waiting here when it should be able to grab the rw_lock. That is wrong and would not happen with BROKEN=0. But it seems like thread 1 should be able to continue running. (And eventually, one of its broadcasts would catch another thread at the right time.) I don't know how this defect in rw_unlock_write could cause the program you've shown to stop entirely. Maybe a legitimate expert will show the full picture... or maybe it is indeed a bug in NPTL. Scott From kukuk at suse.de Thu Apr 15 07:48:42 2004 From: kukuk at suse.de (Thorsten Kukuk) Date: Thu, 15 Apr 2004 09:48:42 +0200 Subject: problems with pthread_cond_broadcast In-Reply-To: <85784573-8EB0-11D8-9E50-000A95891440@slamb.org> References: <20040415055054.GA17099@suse.de> <85784573-8EB0-11D8-9E50-000A95891440@slamb.org> Message-ID: <20040415074842.GA9570@suse.de> On Thu, Apr 15, Scott Lamb wrote: > I think there is a problem in your test case. And I _sort_of_ see it. > Imagine this sequence of events: > > Thread 1 Thread 2 > ---------------------------------------- > rw_lock_write: rw_lock_write: > mutex_lock > n_readers = -1 > mutex_unlock > mutex_lock > n_readers != 0 > rw_unlock_write: > mutex_lock At this place, two threads would have mutex_lock. That should not happen and I don't see how it can happen. Thorsten -- Thorsten Kukuk http://www.suse.de/~kukuk/ kukuk at suse.de SuSE Linux AG Maxfeldstr. 5 D-90409 Nuernberg -------------------------------------------------------------------- Key fingerprint = A368 676B 5E1B 3E46 CFCE 2D97 F8FD 4E23 56C6 FB4B From slamb at slamb.org Thu Apr 15 07:56:26 2004 From: slamb at slamb.org (Scott Lamb) Date: Thu, 15 Apr 2004 02:56:26 -0500 Subject: problems with pthread_cond_broadcast In-Reply-To: <20040415074842.GA9570@suse.de> References: <20040415055054.GA17099@suse.de> <85784573-8EB0-11D8-9E50-000A95891440@slamb.org> <20040415074842.GA9570@suse.de> Message-ID: <6539D28A-8EB2-11D8-9E50-000A95891440@slamb.org> On Apr 15, 2004, at 2:48 AM, Thorsten Kukuk wrote: > On Thu, Apr 15, Scott Lamb wrote: > >> I think there is a problem in your test case. And I _sort_of_ see it. >> Imagine this sequence of events: >> >> Thread 1 Thread 2 >> ---------------------------------------- >> rw_lock_write: rw_lock_write: >> mutex_lock >> n_readers = -1 >> mutex_unlock >> mutex_lock >> n_readers != 0 >> rw_unlock_write: >> mutex_lock > > At this place, two threads would have mutex_lock. That should > not happen and I don't see how it can happen. Indeed. Sorry, I blame a cut'n'paste accident. Trying to come up with the correct sequence. I thought I had it at one point, anyway. Grr. What I'm looking for, though, is a situation where thread 1's pthread_cond_broadcast is after thread 2's n_readers != 0 comparison yet before thread 2's pthread_cond_wait. That can't happen when the mutex is still held for the broadcast. I had 3 threads involved before, maybe that is necessary... Scott From sebastien.decugis at ext.bull.net Thu Apr 15 08:04:13 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 15 Apr 2004 10:04:13 +0200 Subject: problems with pthread_cond_broadcast In-Reply-To: <20040415073038.GA20694@suse.de> References: <20040415055054.GA17099@suse.de> <1082014030.1455.46.camel@decugiss.frec.bull.fr> <20040415073038.GA20694@suse.de> Message-ID: <1082016252.1455.60.camel@decugiss.frec.bull.fr> > > Moreover, POSIX says: > > "The pthread_cond_broadcast() or pthread_cond_signal() functions may be > > called by a thread whether or not it currently owns the mutex that > > threads calling pthread_cond_wait() or pthread_cond_timedwait() have > > associated with the condition variable during their waits" > > > Yes, it is legal to call pthread_cond_broadcast() after > pthread_mutex_unlock(). But POSIX above says "have > associated with the condition variable during their waits". > > This means for me, that they called already pthread_cond_wait() and > not that they are allowed to call pthread_cond_wait() at the same > time as pthread_cond_broadcast() with the same condition variable. > > I cannot find anything in POSIX which says, that it is allowed that > pthread_cond_wait() and pthread_cond_broadcast() access the same > variable at the same time, for me it always sounds like POSIX expects > that other threads are already waiting. > > So are both functions allowed to access the same variable at the > same time or not? I would expect that they are not allowed to do > so and cannot find something different in POSIX. Actually, it seems to me that there is an internal lock in the pthread_cond_t structure, so I think the pthread_cond_wait() and pthread_cond_broadcast() cannot run concurrently (but there might be a race condition I did not see). Concerning the standard, I think that only two cases are possible (assuming that the pthread_cond_XXX functions are atomic against other pthread_cond_XXX functions.): - Either pthread_cond_broadcast() is called while no other thread is waiting on the condition variable. In this case does nothing. - Or it is called while one or more threads are waiting. Then a mutex is (dynamically) associated to the cond var, and the threads shall be awaken. Please correct me if there is no such thing as this atomicity in the NPTL... -- S?bastien DECUGIS Bull S.A. Tel: 04 76 29 74 93 From dgunigun at in.ibm.com Thu Apr 15 08:10:59 2004 From: dgunigun at in.ibm.com (Dinakar Guniguntala) Date: Thu, 15 Apr 2004 13:40:59 +0530 Subject: problems with pthread_cond_broadcast Message-ID: I believe this problem has been resolved with Update 2 level of NPTL. (or the current cvs) Though I really didn't dig deep to see what the problem was Thorsten Kukuk To: phil-list at redhat.com Sent by: cc: phil-list-bounces Subject: problems with pthread_cond_broadcast @redhat.com 04/15/2004 11:20 AM Hi, I have a problem with pthread_cond_wait/pthread_cond_broadcast waiting sometimes forever on a fast SMP machine. Attached is a simple test case. If I use the order pthread_mutex_unlock (&lock); pthread_cond_broadcast (&pcond); with NPTL, the program will hang after a short time running with current glibc + NPTL + kernel 2.6.x on all architectures I tested. If I revert the order to pthread_cond_broadcast (&pcond); pthread_mutex_unlock (&lock); it works fine. Is this a problem of the test case (since pthread_cond_broadcast and pthread_cond_wait will access pcond at the same time in different threads) or is this a glibc/NPTL/kernel problem? Thanks for any hint, Thorsten -- Thorsten Kukuk http://www.suse.de/~kukuk/ kukuk at suse.de SuSE Linux AG Maxfeldstr. 5 D-90409 Nuernberg -------------------------------------------------------------------- Key fingerprint = A368 676B 5E1B 3E46 CFCE 2D97 F8FD 4E23 56C6 FB4B -- Phil-list mailing list Phil-list at redhat.com https://www.redhat.com/mailman/listinfo/phil-list #### breaknptl.c has been removed from this note on April 15 2004 by Dinakar Guniguntala From jakub at redhat.com Thu Apr 15 08:23:51 2004 From: jakub at redhat.com (Jakub Jelinek) Date: Thu, 15 Apr 2004 04:23:51 -0400 Subject: problems with pthread_cond_broadcast In-Reply-To: <20040415073038.GA20694@suse.de> References: <20040415055054.GA17099@suse.de> <1082014030.1455.46.camel@decugiss.frec.bull.fr> <20040415073038.GA20694@suse.de> Message-ID: <20040415082350.GK31589@devserv.devel.redhat.com> On Thu, Apr 15, 2004 at 09:30:38AM +0200, Thorsten Kukuk wrote: > > I think this problem is the same as: > > http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115349 > > and > > http://bugme.osdl.org/show_bug.cgi?id=2364 > > > > Moreover, POSIX says: > > "The pthread_cond_broadcast() or pthread_cond_signal() functions may be > > called by a thread whether or not it currently owns the mutex that > > threads calling pthread_cond_wait() or pthread_cond_timedwait() have > > associated with the condition variable during their waits" > > > > So there is a bug in either NPTL or kernel futexes (FUTEX_REQUEUE ?), as > > your use of pthread_cond_broadcast() after pthread_mutex_unlock() is > > legal. > > Yes, it is legal to call pthread_cond_broadcast() after > pthread_mutex_unlock(). But POSIX above says "have > associated with the condition variable during their waits". The testcase is correct from what I can see. pthread_cond_{,timed}wait must be called with the associated mutex held, while pthread_cond_{signal,broadcast} may but don't have to. The standard says that you should hold the associated mutex in pthread_cond_{signal,broadcast} if you want predictable scheduling behaviour, but that doesn't change anything on conformance when the mutex is not held. Of course you are allowed to access the same condvar with p_c_wait and pc_c_broadcast at the same time. Ulrich spent quite some time on this FUTEX_REQUEUE issue, though I don't think the bug has been discovered. So for the time being we ship RHEL3 with FUTEX_REQUEUE disabled. Jakub From lharkrider at iologics.com Wed Apr 14 23:38:37 2004 From: lharkrider at iologics.com (Harkrider, Larry) Date: Wed, 14 Apr 2004 18:38:37 -0500 Subject: NPTL in RH9 vs RHEL3 Message-ID: <008301c42279$9cc4b1e0$6300fa0a@iologics.com> I run SAPDB on Redhat9, and am trying to migrate to RHEL3. It is well known that SAPDB requires the LD_ASSUME_KERNEL=2.2.5 to run on Redhat. Using this settings, I have run SAPDB on Redhat9 with great success. The same tactic on RHEL3 results in a much slower, and less stable database. It seems that on Redhat9, the LD_ASSUME_KERNEL=2.2.5 disables floating stacks, but somehow leaves NPTL operational (as opposed to toggling linuxthreads). The getconf command indicates this overtly: [RH9]# getconf GNU_LIBPTHREAD_VERSION NPTL 0.29 On RHEL3, however, setting LD_ASSUME_KERNEL=2.2.5 results in nptl being disabled in favor of linuxthreads: [RHEL3]# getconf GNU_LIBPTHREAD_VERSION linuxthreads-0.10 I also see significant differences when running top on the machines. On the RH9 box, the number of SAPDB processes remains minimal. One the RHEL3 box, I see dozens of SAPDB processes. Java behaves similarly. My questions are: 1) whether or not there is something special about the RH9 glibc or kernels that allow nptl to function despite the LD_ASSUME_KERNEL=2.2.5 setting? 2) Is it possible to recompile glibc2.3.2-95.6 for RHEL3 such that it behaves in the same way? 3) Is there a more subtle way to achieve this? Can I discretely disable floating stacks without using the LD_ASSUME_KERNEL setting? Or could I use a version of glibc that has no support for linuxthreads at all, or at least is more reluctant to resort to linux threads? thanks From drepper at redhat.com Fri Apr 16 15:32:10 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Fri, 16 Apr 2004 08:32:10 -0700 Subject: NPTL in RH9 vs RHEL3 In-Reply-To: <008301c42279$9cc4b1e0$6300fa0a@iologics.com> References: <008301c42279$9cc4b1e0$6300fa0a@iologics.com> Message-ID: <407FFC7A.3020303@redhat.com> Harkrider, Larry wrote: > I run SAPDB on Redhat9, and am trying to migrate to RHEL3. [...] This is no topic for this list. If you have a support contract, send the question to your designated support contact. Otherwise ask on some public list dedicated to support questions for these distributions. -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ? From jakub at redhat.com Fri Apr 16 15:35:44 2004 From: jakub at redhat.com (Jakub Jelinek) Date: Fri, 16 Apr 2004 11:35:44 -0400 Subject: NPTL in RH9 vs RHEL3 In-Reply-To: <008301c42279$9cc4b1e0$6300fa0a@iologics.com> References: <008301c42279$9cc4b1e0$6300fa0a@iologics.com> Message-ID: <20040416153542.GQ31589@devserv.devel.redhat.com> On Wed, Apr 14, 2004 at 06:38:37PM -0500, Harkrider, Larry wrote: > I run SAPDB on Redhat9, and am trying to migrate to RHEL3. It is well known > that SAPDB requires the LD_ASSUME_KERNEL=2.2.5 to run on Redhat. Using this > settings, I have run SAPDB on Redhat9 with great success. The same tactic > on RHEL3 results in a much slower, and less stable database. > > It seems that on Redhat9, the LD_ASSUME_KERNEL=2.2.5 disables floating > stacks, but somehow leaves NPTL operational (as opposed to toggling > linuxthreads). The getconf command indicates this overtly: > > [RH9]# getconf GNU_LIBPTHREAD_VERSION > NPTL 0.29 > > On RHEL3, however, setting LD_ASSUME_KERNEL=2.2.5 results in nptl being > disabled in favor of linuxthreads: > > [RHEL3]# getconf GNU_LIBPTHREAD_VERSION > linuxthreads-0.10 LD_ASSUME_KERNEL=2.2.5 env variable setting is exactly about forcing non-FLOATING_STACKS linuxthreads. On RHL9 I get: rpm -q glibc; LD_ASSUME_KERNEL=2.2.5 getconf GNU_LIBPTHREAD_VERSION; getconf GNU_LIBPTHREAD_VERSION glibc-2.3.2-27.9.6 linuxthreads-0.10 NPTL 0.34 and RHEL3: rpm -q glibc; LD_ASSUME_KERNEL=2.2.5 getconf GNU_LIBPTHREAD_VERSION; getconf GNU_LIBPTHREAD_VERSION glibc-2.3.2-95.20 linuxthreads-0.10 NPTL 0.60 There is no difference in this regard. If NPTL is used on RHL9 with LD_ASSUME_KERNEL=2.2.5, then something is broken with your setup, but if that works for you, then you shouldn't actually use LD_ASSUME_KERNEL. > I also see significant differences when running top on the machines. On the > RH9 box, the number of SAPDB processes remains minimal. One the RHEL3 box, > I see dozens of SAPDB processes. Java behaves similarly. NPTL threads share the same pid, while LinuxThreads uses one pid for each thread. Jakub From sdeokulecluster at yahoo.com Tue Apr 27 21:27:16 2004 From: sdeokulecluster at yahoo.com (Sameer Suhas Deokule) Date: Tue, 27 Apr 2004 14:27:16 -0700 (PDT) Subject: Q about VSZ wrt NPTL Message-ID: <20040427212716.36212.qmail@web13122.mail.yahoo.com> Where can I find more info about the following ? >there might be another effect. If this is the thread stack (is it?), then >Linux will lazy-allocate the pages mapped by it, and NPTL will cycle the >stacks (ie. instead of munmap()-ing them, they get cached). I dont >remember the exact tresholds NPTL is using for caching stacks. In any >case, the RSS of the JVM process/threads should show the exact amount of >memory allocated. I am observing that for an application using 10 threads (using nptl on rhel3.0) the pmap for the application shows 10 chunks each of 10240. Would it be correct to assume that each such chunk corresponds to the thread's stack ? also the VSZ for the app process is in excess of 100 MB. Comparing this with the application instance on solaris, the VSZ is never in excess of 10 MB under similar conditions. The RSS for the app is comparable on both solaris and rhel3.0. Any more details explaining this would be appreciated. thanks Sameer --------------------------------- Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastien.decugis at ext.bull.net Thu Apr 29 12:36:44 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 29 Apr 2004 14:36:44 +0200 Subject: Hang in pthread_cond_wait Message-ID: <1083242204.1451.205.camel@decugiss.frec.bull.fr> Hi, I think the futex_requeue feature used in pthread_cond_broadcast can lead to a hang. Please consider the following sequence: Thread A: pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); pthread_cond_broadcast(&cond); /* please note that this use of pthread_cond_broadcast is legal according to POSIX */ Thread B and C: pthread_mutex_lock(&mutex); pthread_cond_wait(&cond, &mutex); pthread_mutex_unlock(&mutex); ------------------------ C: locks the mutex C: enters cond_wait C: locks cond->__lock C: releases the mutex C: cond->total_seq = 1 C: val=seq=0 C: unlocks cond->__lock C: futex_wait (@=cond->wake_up) A : locks the mutex A : unlocks the mutex A : enters pthread_cond_broadcast A : locks cond->__lock A : cond->wake_up=cond_total_seq ( == 1) A : unlocks cond->__lock B : locks the mutex B : enters cond_wait B : locks cond->__lock B : releases the mutex B : cond->total_seq = 2 B : val=seq=1 B : unlocks cond->__lock B : futex_wait (@=cond->wake_up) A : futex_requeue => thread B is awaken, thread C is requeued on mutex. A : will try to lock the mutex on next loop B : locks cond->__lock B : as seq == cond->wake_up, we loop inside the function B : unlocks cond->__lock B : futex_wait (@=cond->wake_up) Both 3 threads are now waiting. The only workaround I can think of is to remove the FUTEX_REQUEUE call from the broadcast function and always do a FUTEX_WAKE (ALL) instead. This might be bad for performances but will avoid such hangs. Please let me know if this sequence is incorrect (and why), as I don't know the internals of futexes. Thanks, Sebastien. -- S?bastien DECUGIS Bull S.A. NPTL Tests & Trace project http://nptl.bullopensource.org/phpBB/ From sebastien.decugis at ext.bull.net Thu Apr 29 13:56:02 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 29 Apr 2004 15:56:02 +0200 Subject: Hang in pthread_cond_wait In-Reply-To: <1083242204.1451.205.camel@decugiss.frec.bull.fr> References: <1083242204.1451.205.camel@decugiss.frec.bull.fr> Message-ID: <1083246961.1451.212.camel@decugiss.frec.bull.fr> > Both 3 threads are now waiting. > I just realized that thread A is not hung at the end of the sequence... but however, this thread won't deal with the mutex anymore so the other threads are really hung. From dgunigun at in.ibm.com Thu Apr 29 14:16:04 2004 From: dgunigun at in.ibm.com (Dinakar Guniguntala) Date: Thu, 29 Apr 2004 19:46:04 +0530 Subject: Hang in pthread_cond_wait Message-ID: > The only workaround I can think of is to remove the FUTEX_REQUEUE call > from the broadcast function and always do a FUTEX_WAKE (ALL) instead. > This might be bad for performances but will avoid such hangs. Another fix for this problem is to hold the internal condvar lock while calling futex_requeue. However the performance numbers for both are about the same with the second fix (holding lock while calling requeue) getting slightly better numbers. (I measured the time for 64 threads to finish a given job for both fixes) Regards, Dinakar From sdeokulecluster at yahoo.com Thu Apr 29 19:37:55 2004 From: sdeokulecluster at yahoo.com (Sameer Suhas Deokule) Date: Thu, 29 Apr 2004 12:37:55 -0700 (PDT) Subject: Q about VSZ wrt NPTL Message-ID: <20040429193755.58349.qmail@web13124.mail.yahoo.com> Maybe I should have been clearer and phrased my question as :- How was the default stack size of 10MB arrived at ? Any insight would be appreciated. thanks --------------------------------- Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From arjanv at redhat.com Thu Apr 29 19:51:44 2004 From: arjanv at redhat.com (Arjan van de Ven) Date: Thu, 29 Apr 2004 21:51:44 +0200 Subject: Q about VSZ wrt NPTL In-Reply-To: <20040429193755.58349.qmail@web13124.mail.yahoo.com> References: <20040429193755.58349.qmail@web13124.mail.yahoo.com> Message-ID: <20040429195144.GA19399@devserv.devel.redhat.com> On Thu, Apr 29, 2004 at 12:37:55PM -0700, Sameer Suhas Deokule wrote: > Maybe I should have been clearer and phrased my question as :- > > How was the default stack size of 10MB arrived at ? Any insight would be > appreciated. it's the default ulimit for stacksize in RHEL3, which glibc copies for thread stacks -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sdeokulecluster at yahoo.com Fri Apr 30 03:08:54 2004 From: sdeokulecluster at yahoo.com (Sameer Suhas Deokule) Date: Thu, 29 Apr 2004 20:08:54 -0700 (PDT) Subject: Q about VSZ wrt NPTL In-Reply-To: <20040429195144.GA19399@devserv.devel.redhat.com> Message-ID: <20040430030854.34195.qmail@web13122.mail.yahoo.com> I am looking for information/answer as to any rationale behind choosing the default stacksize to be 10MB in RHEL3.0. i.e why 10. why not 1 MB or some other value ? Is it related to the processor OR the typical recursion nesting OR something else....? thanks Sameer Arjan van de Ven wrote: On Thu, Apr 29, 2004 at 12:37:55PM -0700, Sameer Suhas Deokule wrote: > Maybe I should have been clearer and phrased my question as :- > > How was the default stack size of 10MB arrived at ? Any insight would be > appreciated. it's the default ulimit for stacksize in RHEL3, which glibc copies for thread stacks > ATTACHMENT part 2 application/pgp-signature --------------------------------- Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs -------------- next part -------------- An HTML attachment was scrubbed... URL: From arjanv at redhat.com Fri Apr 30 07:44:25 2004 From: arjanv at redhat.com (Arjan van de Ven) Date: Fri, 30 Apr 2004 09:44:25 +0200 Subject: Q about VSZ wrt NPTL In-Reply-To: <20040430030854.34195.qmail@web13122.mail.yahoo.com> References: <20040429195144.GA19399@devserv.devel.redhat.com> <20040430030854.34195.qmail@web13122.mail.yahoo.com> Message-ID: <20040430074425.GC25378@devserv.devel.redhat.com> On Thu, Apr 29, 2004 at 08:08:54PM -0700, Sameer Suhas Deokule wrote: > I am looking for information/answer as to any rationale behind choosing the default > stacksize to be 10MB in RHEL3.0. i.e why 10. why not 1 MB or some other > value ? Is it related to the processor OR the typical recursion nesting OR > something else....? For single threaded applications (the ones the kernel sets up the stack for) Linux historically always picked 8 Mb as stack; in RHEL3 we needed to extend that somewhat to give a nett stack of 8Mb (we randomize the stack pointer somewhat) I assume you are aware of the posix function you can call before you create threads that sets the stack size for those threads; if you really care about stack size I would suggest to always call that and not depend on the OS default. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: