From drepper at redhat.com Thu Feb 19 01:06:05 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Wed, 18 Feb 2004 17:06:05 -0800 Subject: [Patch] Fix error path of pthread_cond_timedwait In-Reply-To: References: Message-ID: <40340BFD.8060102@redhat.com> Dinakar Guniguntala wrote: > I had come up with the same fix that you have now in CVS but I found > that it can lead to the following sequence of events > [...] This is indeed a valid scenario. I've changed all the implementations. Thanks, -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ? From inaky.perez-gonzalez at intel.com Thu Feb 19 02:10:56 2004 From: inaky.perez-gonzalez at intel.com (Perez-Gonzalez, Inaky) Date: Wed, 18 Feb 2004 18:10:56 -0800 Subject: Thread starvation with mutex Message-ID: > From: Jamie Lokier [mailto:jamie at shareable.org] > Perez-Gonzalez, Inaky wrote: > > > If you have strict ownership transferal _and_ priority sorted wake ups > > in the kernel, then that problem should not be an issue at all, > > Yes, if you have both those things. > > I was thinking of an old RT futex patch which simply offered priority > sorted wakeups, and was not sure if that's what this thread's question > about "RTNTPL offering complete RT support" referred to. Me neither [puzzled and confused again] -- let's say that old RT futex patch provided the foundation for one of the things RT needs (wake up or unlock by priority order). > The down side is that if you always have strict ownership transferral, > you get very poor performance in a large class of algorithms which > take and release locks regularly - such as producer-consumer queuing > to pick a classic one. And that's why it is made optional or switchable, so you can turn in on or off depending on the need. In the other thread I will explain all the details, so we can kill this one... I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault) From drepper at redhat.com Thu Feb 19 04:27:40 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Wed, 18 Feb 2004 20:27:40 -0800 Subject: Barriers hanging In-Reply-To: <1076944786.1436.26.camel@decugiss.frec.bull.fr> References: <1076944786.1436.26.camel@decugiss.frec.bull.fr> Message-ID: <40343B3C.1030805@redhat.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sebastien Decugis wrote: > NPTL barriers have a weakness in the pthread_barrier_wait() and > pthread_barrier_destroy() functions. You call it weakness because you want to depend on other functionality. In reality, the only weak code is your's since you depend on non-standard functionality. The proposed changes really look bad. You wake all the threads just to have them run into the next lock. The scheduling of this is killing the performance, especially with many threads and many processors. Having said this, I did make some changes which are far less intrusive and which still guarantee that no pthread_barrier_destroy succeeds if there is still a thread which hasn't returned from a previous pthread_barrier_wait call. The code is written so that no part of the barrier object is touched after the last thread leaves. Your test program, although broken, runs with the changed code. The new tst-barrier4 test exercises the new functionality but, as it is clearly noted, this is no test for POSIX conformance. It tests additional functionality of NPTL. - -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ? -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQFANDs82ijCOnn/RHQRAkNsAKCHr67CL2lw6B58UPh1VeKYZg0MYQCgmbtL 3WksxJzByjHqwJF2Vt+N3Eo= =E11K -----END PGP SIGNATURE----- From sebastien.decugis at ext.bull.net Thu Feb 19 10:51:29 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 19 Feb 2004 11:51:29 +0100 Subject: Barriers hanging Message-ID: <1077187880.1433.301.camel@decugiss.frec.bull.fr> > You call it weakness because you want to depend on other functionality. > In reality, the only weak code is your's since you depend on > non-standard functionality. I am sorry I don't understand which functionnality you means? barriers? or the fact that after having passed the pthread_barrier_wait I assume the barrier is free and can be destroyed? > > The proposed changes really look bad. You wake all the threads just to > have them run into the next lock. The scheduling of this is killing the > performance, especially with many threads and many processors. I was aware of this but could not find a better idea... That's the purpose of this list, isn't it? I mean, that experienced people correct and help young people, to increase quality of NPTL...? > > > Having said this, I did make some changes which are far less intrusive > and which still guarantee that no pthread_barrier_destroy succeeds if > there is still a thread which hasn't returned from a previous > pthread_barrier_wait call. I think this is far better. Since it means that basic programmer who want to use a barrier will have to write things like: thread_A: pthread_barrier_init(&b, NULL, 2); pthread_create(..thread_B...); pthread_barrier_wait(&b); do { rc = pthread_barrier_destroy(&b); } while (rc != 0) thread_B: pthread_barrier_wait(&b); Don't you think everyone who doesn't know anything inside the NPTL will assume that calling pthread_barrier_destroy() after pthread_barrier_wait() is a good way to ensure pthread_barrier_destroy will not fail, even if the standard is unclear? ( I read the following lines a lot of time before understanding their meaning...) "The results are undefined if pthread_barrier_destroy() is called when any thread is blocked on the barrier" "[EBUSY] The implementation has detected an attempt to destroy a barrier while it is in use (for example, while being used in a pthread_barrier_wait() call) by another thread" I may have misunderstood the meaning of EBUSY, but at first I thought it meant that a thread is still blocked on the barrier... > The code is written so that no part of the > barrier object is touched after the last thread leaves. Your test > program, although broken, runs with the changed code. The new > tst-barrier4 test exercises the new functionality but, as it is clearly > noted, this is no test for POSIX conformance. It tests additional > functionality of NPTL. Once more, I don't understand why you say it is not POSIX conformance... Standard says at least that the EBUSY should be returned when the barrier data is still used in a function, which is precisely the problem I ran into. Additionnaly, it seems that "if (atomic_exchange_and_add (ibarrier->left, 1) == init_count - 1)" does not compile on ia64, but I am not sure of this since it is the first time I try and compile on this target. I get the following errors: ../nptl/sysdeps/pthread/pthread_barrier_wait.c: In function `pthread_barrier_wait': ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: error: invalid type argument of `unary *' ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: warning: type defaults to `int' in declaration of `__result' ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: error: invalid type argument of `unary *' ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: warning: cast to pointer from integer of different size ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: error: invalid type argument of `unary *' ../nptl/sysdeps/pthread/pthread_barrier_wait.c:76: warning: cast to pointer from integer of different size make[2]: *** [/mnt/home1/home/decugiss/libc/build/nptl/pthread_barrier_wait.o] Error 1 Can someone confirm it compiles well? thank you! Best regards, S?bastien Decugis. From sebastien.decugis at ext.bull.net Thu Feb 19 17:20:07 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Thu, 19 Feb 2004 18:20:07 +0100 Subject: Barriers hanging In-Reply-To: <1077187880.1433.301.camel@decugiss.frec.bull.fr> References: <1077187880.1433.301.camel@decugiss.frec.bull.fr> Message-ID: <1077211206.1432.35.camel@decugiss.frec.bull.fr> > > Having said this, I did make some changes which are far less intrusive > > and which still guarantee that no pthread_barrier_destroy succeeds if > > there is still a thread which hasn't returned from a previous > > pthread_barrier_wait call. > > I think this is far better. > Sorry for this, I had not understood your changes. It appears to me that with your new version, pthread_barrier_destroy will return only once every other threads exits pthread_barrier_wait. This is basically what I wanted, but I agree this is not implied by the POSIX spec. Anyway, i think the following assertion > "[EBUSY] > The implementation has detected an attempt to destroy a barrier > while it is in use (for example, while being used in a > pthread_barrier_wait() call) by another thread" was not OK with the previous pthread_barrier_wait / _destroy I think that if you modify your tst-barrier4.c with looping on pthread_barrier_destroy() while return code == EBUSY, you will have a POSIX conformance test, won't you? Best regards, S?bastien Decugis. From drepper at redhat.com Thu Feb 19 17:43:06 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Thu, 19 Feb 2004 09:43:06 -0800 Subject: Barriers hanging In-Reply-To: <1077211206.1432.35.camel@decugiss.frec.bull.fr> References: <1077187880.1433.301.camel@decugiss.frec.bull.fr> <1077211206.1432.35.camel@decugiss.frec.bull.fr> Message-ID: <4034F5AA.9030403@redhat.com> Sebastien Decugis wrote: > Anyway, i think the following assertion > > >>"[EBUSY] >> The implementation has detected an attempt to destroy a barrier >> while it is in use (for example, while being used in a >> pthread_barrier_wait() call) by another thread" Read the specification. This is a "may" error. It is not necessary to recognize this situation. -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ? From abisain at qualcomm.com Fri Feb 20 00:00:40 2004 From: abisain at qualcomm.com (Abhijeet Bisain) Date: Thu, 19 Feb 2004 16:00:40 -0800 Subject: Question about pthread_attr in NPTL Message-ID: <6.0.0.22.2.20040219154748.03a06118@mage.qualcomm.com> Hi, I have found that setting the sched_param/policy in pthread_attr does not affect the sched_policy/priority. I called the getschedparam in the new thread and it return 0 for policy and priority. Is this not supported? Thanks, Abhijeet From crystal.xiong at intel.com Fri Feb 20 00:54:26 2004 From: crystal.xiong at intel.com (Xiong, Crystal) Date: Fri, 20 Feb 2004 08:54:26 +0800 Subject: The failure in timer_getoverrun Message-ID: Hi all, I ran a test case for timer_getoverrun in posixtestsuite project based on kernel 2.6.1 with libc-2004-02-01. The case didn't return the expected overrun numbers as expected. Platform: -------------------- libc: 2004-02-01 Linux kernel: 2.6.1-mm2 SMP on ia32 gcc-3.3.3-20040209 Redhat EL-3.0-update1 Test Case and Output ---------------------- The case can be found at http://cvs.sourceforge.net/viewcvs.py/posixtest/posixtestsuite/conformance/interfaces/timer_getoverrun/2-2.c Output: FAIL: 62 overruns sent; expected 75 I know there maybe some problems to let nanosleep sleep for the time equal to the time of the expired times. But is there any way to make sure the timer expire the accurate times we want? So then we can get the correct overrun numbers? Attached the strace log. Thanks, Crystal -------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: strace.log Type: application/octet-stream Size: 5157 bytes Desc: strace.log URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: ATT272527.txt URL: From inaky.perez-gonzalez at intel.com Fri Feb 20 09:11:59 2004 From: inaky.perez-gonzalez at intel.com (Perez-Gonzalez, Inaky) Date: Fri, 20 Feb 2004 01:11:59 -0800 Subject: Newbie question on mutexes Message-ID: > From: Jamie Lokier [mailto:jamie at shareable.org] > Perez-Gonzalez, Inaky wrote: Well, first of all, sorry for the late response; I've been up to my neck these last two weeks. > Not doing ownership transferral is essential to avoid excessive scheduling. Agreed completely. We are _very_ aware of that problem and are actively working on a solution to it, that I've outlined already: on the RTNPTL+fusyn case, it is up to the application to decide what kind of unlock is done, if strict ownership transferal (expensive one in scheduling terms in depending which cases) or the nonstrict, competitive one (did you call it stochastic?). Of course, we default to the last one. Now, this gives the best of both worlds, so we don't force anything. On doing an automatic guess as you mention--that makes sense and seems very doable, but for RTNPTL+fusyn we'd have to refine the selection rules, but those are implementation details. > > > One clean and general way to implement general ownership transferral > > > may be a combined wake+wait kernel primitive: you tell the kernel to > > > wait on a futex until the word has a certain matching value (and > > > futex_wake is called), then the kernel substitutes a different value > > > and wakes you. You specify both values. > > > > That's more or less the basis of fulocks :) > > Oh, you mean we can just add the function to futex and slip it in > quietly that way? :) Heh heh...you are getting ahead. I said basis because they are the basis, not the base. I think it is time for a little bit of history of how fusyn came alive and why each piece is there; I hope this will help you (and anyone else interested) understand the reason for each decission. In some sense, the RT requirements for locks boil down to: 1. O(1) operations everywhere [as possible] 2. Wake up/unlock by priority 3. Robustness (when a lock owner dies, first waiter is woken up owning the lock with an special error code). 4. Minimization of priority inversion: 4.1 on the acquisition of the lock 4.2 when the lock is acquired by a lower prio owner: priority protection 4.3 when the lock is acquired by a lower prio owner: priority inheritance 5. deadlock detection 6. Uncontended locks/unlocks must happen without kernel intervention [this is always implied as needed, so I won't name it all the time during the discussion] If you remember, maybe not, my first inroad into this was a patch that modified the selection rules of the futex to have them sorted by priority using an adaptation of Ingo's prioarray list. It was the pfutex patch. This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so the memory footprint per futex was huge--as well, we needed to take into account the possible -ENOMEM error code, as now we could not just hook each waiter to the hash chain, we needed a single node per futex and then sort them there [if not, walking the chain looking for the highest prio waiter would have made it O(N)]. This also introduced another lag, in that the "node" had to be allocated each time we waited. So, it didn't solve 3-5. The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed bump on the allocation problem. The allocation problem was solved with a cache: the "node" would be only freed by a garbage collector if it had been unused for a while; as most mutexes in well-designed programs have a low contention rate, this avoiding of allocations improved the speed of the whole thing by a few factors. Now 3 was more painful. In order to be able to pass ownership when an owner dies, the waiters must be able to identify a dead owner. The two basic cases of this are when an owner died with no waiters and when it died with waiters. The later is simple, as long as the kernel knows about the owner, and traps it exiting without unlocking in do_exit(). This implied that the futex (or whatever) needed to have the concept of ownership, as for who is owning it. As well, for the kernel to be able to identify who it owned during do_exit(), each task needs an ownership list where the locks it owns are registered. As for the first one, there is no way to know who has locked it if the word in userspace just contained some random number, depending on how the locking impl was done. We needed an identifier that we could use to map to a task. The pid comes as natural, so we chose that. This turns in the requirement for an cmp and exchange atomic op to place a PID in the userspace word. This way, we can lock fast, and if we die, a second waiter will see the PID, go to the kernel to wait and the kernel will use this opportunity to identify the owner of the lock based on the value of the userspace word. If not found, it means it died and recovery is started as above. Recapping the last two paragraphs: the futex needs an ownership concept; I did this in rtfutex, and results in a mess, because the futex is just a queue, and so is its interface. This led me to add a new type of object, the fulock, that supports the ownership concept. Now, we also need to indicate that there are waiters in the kernel, so userspace does the slow path when unlocking and we work correctly. In rtfutex we had a bit (31st) set to indicate that, but it proved to be a PITA to maintain and made the userspace operation way more complex. It worked though. However, having 3 means that we need to maintain the health status of the lock, as it can be healthy (normal), dead (some previous owner died) and not-recoverable (the data protected by this lock has been deemed inconsistent and nothing can be done, abort). That is used by robust applications as a way between users of a lock to know what needs to be done. While the kernel knows about the lock, it is easy, because it is in control and we keep it there; but if the lock is removed from the cache, we need the next locker in userspace to know that the lock is dead or healthy or whatever. The only place to put that without creating more havoc is the user space word. We could use another word, contiguous, but that would make modifying it from kernel space a pain. We could use another bit, but that'd create more hassle in the lock operation, so we used a value (0xffffffff); the user space op on seeing that goes down to the kernel and the kernel does the lock operation for a dead lock, because there is no place to put the PID without overwriting the health status. That simplifies many things, and being an special case, the speed toll is acceptable. [similarly for not-recoverable, 0xfffffffe, and all operations fail]. As we are talking about the userspace word, we get to requirement number 4.1. It is solved through strict ownership transferal. We know that it's slow, so we need to switch it on ONLY when we care about it. We care about it when we are doing RT and we don't want priority inversion at all and when we want the safest robustness ever. Up to the application to switch it on, but both modes, strict and non-strict, are pretty easy to implement (as a side note, in RTNPTL+fusyn, non-strict should be a little bit faster because we don't need to hop up and down from/to the kernel). For 4.2 and 4.3, we have the usual mechanisms -- this is painful and adds a lot of code. Basically it involves walking up a chain of changes and propagating priorities. rtfutex had it, but it was _far_ from correct, clean and solid. fusyn has it way better under control (needs some improvements in the modifications to the sched.c code though). The actual reason why there is a lot of code is to protect against things disappearing below our feet. As well, for being able to do priority inheritance and protection, you need to know who is the owner, so we are back at needing the concept of ownership. And finally, number (5). This is a simple one, as it involves making sure there are no loops in the chain of waiters/owners. Side notes on (1). I want to change the priority list implementation from the O(N) that addition is now to an O(Number of priorities) == O(140) implementation, that effectively is O(1). Once this is done, the only O(N) operations in fusyn will be the hash lookup and the priority inheritance propagation. The last one has no way to be solved, it is like that. The first one could be solved doing some magic that would require no hash table, by storing a cookie straight into the user space (by the word, for example). Research material. Other O(N) ops are inherent and only used from time to time (dead owner recovery on many locks, switching to not-recoverable each waiter on a lock, etc, etc), nothing major to worry about. Sorry for the long story, but I cannot summarize it any more -- I hope it explains correctly why futexes ain't enough when we are trying to do all that. They work pretty well for what we have now, but sadly, they fall sort for other stuff. > > > mentioned above. Also it could stochastically prevent starvation, > > > without forcing slow convoys, by occasionally deciding to transfer > > > ownership for non-RT tasks based on a random number. > > > > But that breaks the determinism you want for RT applications. In RT > > you want to force a guy to starve. If it is starving and I don't know > > why, then I have a bigger problem. > > I mean (this is with a bit more detail to help the SMP case): > > - Ownership transferral is made possible using a futex wait+modify > operation. The modify part can fail, either if the word doesn't > have the correct comparison value, or if the waker decides not > to do ownership transferral. > > - futex_wake always does ownership transferal when there's a > higher RT priority waiter, and to the highest priority one of > course. But remember, futexes have no ownership concept; they are not a lock. Is it ok to assume you are talking about another entity similar to futex but with ownership concept? > - Otherwise, ... I see -- these are basically the policies to decide when and how to do it. I am more inclined (as I think I said somewhere) to let the application choose and we default to non-strict; maybe trying to be too smart will make unlock() expensive, maybe not, but I see the positive side of your approach, in not needing to have the application know about those issues [and we would not have the same problem that the Mars Rover had, for example...] If you don't mind, I'd like you to go over the explanation I gave above and tell me where do you agree and where do you not. I'd really like to make sure that you understand why did I make each design decision and why futex could not fulfill it, and where necessary, to understand why am I wrong. Thx, I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault) From s-sagawa at jp.fujitsu.com Fri Feb 20 11:56:29 2004 From: s-sagawa at jp.fujitsu.com (Shunichi Sagawa) Date: Fri, 20 Feb 2004 20:56:29 +0900 Subject: the global variable, dl_loaded, must be accessed on the exclusive conditions Message-ID: <20040220205629.6cd9c183.s-sagawa@jp.fujitsu.com> Hi, In glibc-2.3.2, I think that _dl_addr() function should access the global variable, dl_loaded, with a lock. (File name elf/dl-addr.c) _dl_addr() accesses the global variable dl_loaded. This variable is also accessed by dlopen() and dlclose(). dlopen() and dlclose() have a lock GL(dl_loaded) to protect it for multithreaded programs, but _dl_addr() has none. 3 functions shown below have the same problem. __dlvsym() : File name dlfcn/dlvsym.c dlsym() : File name dlfcn/dlsym.c _dl_fini() : FIle name elf/dl-fini.c ----start---- diff -Nur glibc-2.3.2.org/dlfcn/dlsym.c glibc-2.3.2/dlfcn/dlsym.c --- glibc-2.3.2.org/dlfcn/dlsym.c 2001-07-08 04:20:52.000000000 +0900 +++ glibc-2.3.2/dlfcn/dlsym.c 2004-02-20 15:41:38.000000000 +0900 @@ -44,9 +44,13 @@ dlsym (void *handle, const char *name) { struct dlsym_args args; + void *funchandle; args.who = RETURN_ADDRESS (0); args.handle = handle; args.name = name; - return (_dlerror_run (dlsym_doit, &args) ? NULL : args.sym); + __rtld_lock_lock_recursive (GL(dl_load_lock)); + funchandle = (_dlerror_run (dlsym_doit, &args) ? NULL : args.sym); + __rtld_lock_unlock_recursive (GL(dl_load_lock)); + return funchandle; } diff -Nur glibc-2.3.2.org/dlfcn/dlvsym.c glibc-2.3.2/dlfcn/dlvsym.c --- glibc-2.3.2.org/dlfcn/dlvsym.c 2001-07-08 04:20:52.000000000 +0900 +++ glibc-2.3.2/dlfcn/dlvsym.c 2004-02-20 15:42:55.000000000 +0900 @@ -45,12 +45,16 @@ __dlvsym (void *handle, const char *name, const char *version_str) { struct dlvsym_args args; + void *funchandle; args.handle = handle; args.name = name; args.who = RETURN_ADDRESS (0); args.version = version_str; - return (_dlerror_run (dlvsym_doit, &args) ? NULL : args.sym); + __rtld_lock_lock_recursive (GL(dl_load_lock)); + funchandle = (_dlerror_run (dlvsym_doit, &args) ? NULL : args.sym); + __rtld_lock_unlock_recursive (GL(dl_load_lock)); + return funchandle; } weak_alias (__dlvsym, dlvsym) diff -Nur glibc-2.3.2.org/elf/dl-addr.c glibc-2.3.2/elf/dl-addr.c --- glibc-2.3.2.org/elf/dl-addr.c 2002-09-28 12:35:22.000000000 +0900 +++ glibc-2.3.2/elf/dl-addr.c 2004-02-20 15:44:30.000000000 +0900 @@ -32,6 +32,7 @@ const char *strtab; ElfW(Word) strtabsize; + __rtld_lock_lock_recursive (GL(dl_load_lock)); /* Find the highest-addressed object that ADDRESS is not below. */ match = NULL; for (l = GL(dl_loaded); l; l = l->l_next) @@ -55,8 +56,10 @@ break; } - if (match == NULL) + if (match == NULL) { + __rtld_lock_unlock_recursive (GL(dl_load_lock)); return 0; + } /* Now we know what object the address lies in. */ info->dli_fname = match->l_name; @@ -106,6 +109,7 @@ info->dli_saddr = NULL; } + __rtld_lock_unlock_recursive (GL(dl_load_lock)); return 1; } libc_hidden_def (_dl_addr) diff -Nur glibc-2.3.2.org/elf/dl-fini.c glibc-2.3.2/elf/dl-fini.c --- glibc-2.3.2.org/elf/dl-fini.c 2002-11-09 04:35:00.000000000 +0900 +++ glibc-2.3.2/elf/dl-fini.c 2004-02-20 15:45:27.000000000 +0900 @@ -49,6 +49,7 @@ /* XXX Could it be (in static binaries) that there is no object loaded? */ assert (GL(dl_nloaded) > 0); + __rtld_lock_lock_recursive (GL(dl_load_lock)); /* Now we can allocate an array to hold all the pointers and copy the pointers in. */ maps = (struct link_map **) alloca (GL(dl_nloaded) @@ -182,4 +183,5 @@ final number of relocations from cache: %lu\n", GL(dl_num_cache_relocations)); } + __rtld_lock_unlock_recursive (GL(dl_load_lock)); } ----end---- Best Regards, Shunichi Sagawa From jamie at shareable.org Fri Feb 20 15:29:17 2004 From: jamie at shareable.org (Jamie Lokier) Date: Fri, 20 Feb 2004 15:29:17 +0000 Subject: Newbie question on mutexes In-Reply-To: References: Message-ID: <20040220152917.GD8994@mail.shareable.org> Perez-Gonzalez, Inaky wrote: > or the nonstrict, competitive one (did you call it stochastic?). No, competitive and stochastic are different. I'd say competetive scheduling for ownership transferral is where the kernel's dynamic scheduling priority heuristic decides who gets to run next. Stochastic transferral adds a twist which says ownership will be transferred even if the dynamic priorities indicate otherwise, some of the time (not very often). This is intended to break livelock starvation scenarios which occur due to the dynamic priority heuristic synchronising with the program. (The mutex starvation example recently is an example of that. Although moving competitive ownership transferral to the kernel solves that particular arrangement, there are others that it doesn't solve). The stochastic element doesn't provide any guarantee as to how fast the starvation will be broken, just that it will eventually (if the randomisation is good). Also, even this only breaks starvations involving a single lock; more complex ones can still livelock, I think. (I haven't given it a lot of thought; it just seems likely). > Now, this gives the best of both worlds, so we don't force anything. > > On doing an automatic guess as you mention--that makes sense and seems > very doable, but for RTNPTL+fusyn we'd have to refine the selection rules, > but those are implementation details. It's not a guess, it's a rule. :) Just like RT priority means you _definitely_ will run higher priority threads when they become runnable, it can also mean you _definitely_ pass them mutex ownership when they are waiting on the mutex. > In some sense, the RT requirements for locks boil down to: > > 1. O(1) operations everywhere [as possible] > 2. Wake up/unlock by priority > 3. Robustness (when a lock owner dies, first waiter is woken up owning > the lock with an special error code). > 4. Minimization of priority inversion: > 4.1 on the acquisition of the lock > 4.2 when the lock is acquired by a lower prio owner: priority protection > 4.3 when the lock is acquired by a lower prio owner: priority inheritance > 5. deadlock detection As you know, Linus has often rejected priority protection and inheritence within the kernel itself, on the grounds that it doesn't provide complete protection anyway, so why pretend, and is rather complicated. What's you justification for making 4 a requirement? > 6. Uncontended locks/unlocks must happen without kernel intervention > [this is always implied as needed, so I won't name it all the time > during the discussion] I'd say they must happen without kernel intervention _most_ of the time. The overhead must be low, but if say 0.1% of uncontented mutex wakeups enter the kernel, that's not much overhead and it's still O(1). So far all our implementations do it all the time, but it's feasible that an RT locking mechanism might choose to enter the kernel occasionally, for example for stochastic or accounting purposes. > If you remember, maybe not, my first inroad into this was a patch that > modified the selection rules of the futex to have them sorted by priority > using an adaptation of Ingo's prioarray list. It was the pfutex patch. Yes. > This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so > the memory footprint per futex was huge--as well, we needed to take into > account the possible -ENOMEM error code, as now we could not just hook > each waiter to the hash chain, we needed a single node per futex and then > sort them there [if not, walking the chain looking for the highest prio > waiter would have made it O(N)]. This also introduced another lag, in that > the "node" had to be allocated each time we waited. So, it didn't solve > 3-5. Yes, but that was just because the implementation sucked. :) All of those things are solvable, efficiently and with no memory allocation. Well, I haven't done so but I'm confident . :) > The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed > bump on the allocation problem. The allocation problem was solved with > a cache: the "node" would be only freed by a garbage collector if it had > been unused for a while; as most mutexes in well-designed programs have > a low contention rate, this avoiding of allocations improved the speed of > the whole thing by a few factors. Sure, but the allocation is not needed. Instead of a hash table of waiters, you simply have a hash table of priority queues - and priority queues are simply a certain kind of tree, so each node is a small fixed size and exists in the context of each waiter. > As for the first one, there is no way to know who has locked it if the word > in userspace just contained some random number, depending on how the locking > impl was done. We needed an identifier that we could use to map to a task. > The pid comes as natural, so we chose that. This turns in the requirement for > an cmp and exchange atomic op to place a PID in the userspace word. This way, > we can lock fast, and if we die, a second waiter will see the PID, go to the > kernel to wait and the kernel will use this opportunity to identify the owner > of the lock based on the value of the userspace word. If not found, it means > it died and recovery is started as above. That isn't robust: PIDs are recycled relatively quickly. It is deliberate, to keep PIDs from being large numbers if the number of processes doesn't require that. Also, it confines locks to the process: they cannot be passed between processes easily. Also, you cannot detect when a _thread_ dies, which is sometimes useful: threads within a process are sometimes quite logically independent. Here's a proposal: each task fetches one or more "owner ids" from a large space which is not recycled quickly. Each owner id is associated with a file descriptor. (In your scenarios, this means a task opens one such descriptor and keeps it open for the duration of the task). When the file descriptor is closed, that owner id is freed and any mutexes with are are considered dead. That grants you the capability to have locks where the ownership is shared collectively (by sharing the fd between processes), and where the ownership is confined to one thread or a group of threads (by having the fd open only in those threads - that's a clone() capability, outside the scope of POSIX but useful for some libraries). > However, having 3 means that we need to maintain the health status > of the lock, as it can be healthy (normal), dead (some previous > owner died) and not-recoverable (the data protected by this lock has > been deemed inconsistent and nothing can be done, abort). That is > used by robust applications as a way between users of a lock to know > what needs to be done. While the kernel knows about the lock, it is > easy, because it is in control and we keep it there; but if the lock > is removed from the cache, we need the next locker in userspace to > know that the lock is dead or healthy or whatever. The only place to > put that without creating more havoc is the user space word. We > could use another word, contiguous, but that would make modifying it > from kernel space a pain. I agree it has to be the same word. Many platforms don't support 64-bit atomic operations, and you need a way to test-and-acquire a lock while setting the owner id iff the lock is acquired. It's unfortunate that compare-and-exchange is needed, though, as some Linux platforms don't have it - but that can be worked around, by always entering the kernel on those platforms (just for the sake of software compatibility). > 4.1. It is solved through strict ownership transferal. We know that > it's slow, so we need to switch it on ONLY when we care about it. We > care about it when we are doing RT and we don't want priority > inversion at all and when we want the safest robustness ever. Up to > the application to switch it on, but both modes, strict and > non-strict, are pretty easy to implement I don't see why you would ever _not_ want strict transferral when a waiter has a strictly higher RT priority. That's pretty much what RT priorities are for. When scheduling among SCHED_OTHER tasks, strict transferral doesn't provide any logical guarantees except on a lightly loaded system, simply because the scheduler doesn't provide guarantees of any kind. A task might be runnable but not run for a long time. However I can see some applications might want to request it. I think that _most_ application would want the kernel to automaticaly do ownership transferral if there's a strictly higher RT priority task waiting, and use the default scheduling otherwise. That's because the default scheduling is most efficient, so it will be requested by most code, but if your program has an RT thread, you very likely want _all_ code in the program, including libraries that aren't RT-aware, to do strict transferral whenever there's an RT thread waiting on a lock. For example: a Tcl/Tk GUI calling a library which spawns an RT data acquisition thread. Whenever the Tcl interpreter releases a lock, if the RT thread is waiting it should be passed ownership immediately, yet the Tcl interpreter code is not going to be written with the assumption that there's an RT thread waiting, and will use the default "high performance" lock mode for the object locks. Therefore that automatic decision should be the default mode implemented by the kernel. Forcing strict transferral is ok as an alternative, but it won't be the default that most library code uses. > (as a side note, in RTNPTL+fusyn, non-strict should be a little bit > faster because we don't need to hop up and down from/to the kernel). How do you avoid entering the kernel when you release a lock with a waiter, if you do non-strict transferral? Why do you need to enter the kernel when you release a lock with no waiter, if you do strict transferral? In other words, I don't see why non-strict makes any difference to the number of times you'll enter the kernel. Why? > For 4.2 and 4.3, we have the usual mechanisms -- this is painful and adds > a lot of code. Basically it involves walking up a chain of changes and > propagating priorities. rtfutex had it, but it was _far_ from correct, > clean and solid. fusyn has it way better under control (needs some > improvements in the modifications to the sched.c code though). The actual > reason why there is a lot of code is to protect against things disappearing > below our feet. As well, for being able to do priority inheritance and > protection, you need to know who is the owner, so we are back at needing > the concept of ownership. Once you get to priority inheritance, I think there's a good case that it should be implemented right in the kernel's waitqueues, as a real time option, rather than only working for userspace locks. Especially as you had to dig into sched.c anyway. That'll most likely be rejected from the main kernel tree, due to the performance penalty and complexity (unless you come up with a really good implementation), and because Linus doesn't agree with priority inheritance on principle. Then again he didn't agree with kernel pre-emption until a good implementation with demonstrable benefits appeared. > Side notes on (1). I want to change the priority list implementation from > the O(N) that addition is now to an O(Number of priorities) == O(140) > implementation, that effectively is O(1). Once this is done, the only > O(N) operations in fusyn will be the hash lookup and the priority > inheritance propagation. The last one has no way to be solved, it is like > that. The first one could be solved doing some magic that would require > no hash table, by storing a cookie straight into the user space (by the > word, for example). Those optimisation concerns are all in the wrong places. A priority queue is easily implemented in O(log n) where n is the number of entries in it, _or_ the number of distinct priorities in it at any time, _or_ amortised in various ways. It's possible that some "O(140)" algorithms will run faster in practice just because the pointer manipulation is simpler. In reality, you are not going to have large chains of strictly different priorities, so the inheritance propagation code should focus on being lean and simple in the common cases, where there's only zero or one other priority to worry about. Again, priority queues can be helpful for optimising these algorithms. The hash table is definitely _not_ a bottleneck. Firstly, it is effectively "O(1)" until you have a large number of simultaneously waiting tasks. If those are RT tasks, RT performance is useless anyway because you can't provide any useful RT latency guarantees when there are many thousands of RT tasks. Secondly, at the moment hash buckets contain a list of waiters hashing to the same bucket. If you had a large number of stagnant waiters all in the same bucket (probably because you created 1000 sleeping tasks all waiting on the same lock), and you were busily doing futex operations on a single address, then the busy tasks do pay a big penalty which is not necessary. The solution is for hash buckets to contain a list of _locks_, and each entry in the list having a list of waiters for that lock (or for our purposes, a each entry has a priority queue of waiters for that lock). I think this is much like what you did with those "allocated nodes" (sorry, I didn't read the code in detail), although the allocation and GC aren't needed with an appropriate data structure. > > - Ownership transferral is made possible using a futex wait+modify > > operation. The modify part can fail, either if the word doesn't > > have the correct comparison value, or if the waker decides not > > to do ownership transferral. > > > > - futex_wake always does ownership transferal when there's a > > higher RT priority waiter, and to the highest priority one of > > course. > > But remember, futexes have no ownership concept; they are not a lock. Is > it ok to assume you are talking about another entity similar to futex > but with ownership concept? No, I mean futexes as they are. They do have an ownership concept: when userspace calls into the kernel futex_wake(), _then_ the owner is clearly the task doing the call. That is the appropriate time to do ownership transferral, by waking up another task and _telling_ that task it has been woken as a result of ownership transferral. > I see -- these are basically the policies to decide when and how to > do it. I am more inclined (as I think I said somewhere) to let the > application choose and we default to non-strict; maybe trying to be > too smart will make unlock() expensive, maybe not, but I see the > positive side of your approach, in not needing to have the application > know about those issues [and we would not have the same problem that > the Mars Rover had, for example...] Do you mean your approach causes the Mars Rover problem, or mine does? :) I think it is fine to give application options, although it's preferable if those don't slow down the fast paths. However, I think the option applications should use by default should be one where the kernel does strict ownership transferral only to higher priority RT tasks, and applications _must_ not be expected to know whether there is such a task waiting. Enjoy, -- Jamie From jamie at shareable.org Fri Feb 20 17:48:21 2004 From: jamie at shareable.org (Jamie Lokier) Date: Fri, 20 Feb 2004 17:48:21 +0000 Subject: Newbie question on mutexes In-Reply-To: <20040220152917.GD8994@mail.shareable.org> References: <20040220152917.GD8994@mail.shareable.org> Message-ID: <20040220174821.GH8994@mail.shareable.org> ps. Anyone replying to the parent mail, please remember to change my email address in the headers, to jamie at shareable.org. Thanks, -- Jamie From drepper at redhat.com Fri Feb 20 19:29:21 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Fri, 20 Feb 2004 11:29:21 -0800 Subject: the global variable, dl_loaded, must be accessed on the exclusive conditions In-Reply-To: <20040220205629.6cd9c183.s-sagawa@jp.fujitsu.com> References: <20040220205629.6cd9c183.s-sagawa@jp.fujitsu.com> Message-ID: <40366011.90107@redhat.com> This has nothing to do with nptl. It cannot be so hard to use the correct mailing lists. If no file in the nptl/ subdir of the sources is touched, send it to libc-alpha at sources.redhat.com. -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ? From inaky.perez-gonzalez at intel.com Sat Feb 21 17:24:01 2004 From: inaky.perez-gonzalez at intel.com (Perez-Gonzalez, Inaky) Date: Sat, 21 Feb 2004 09:24:01 -0800 Subject: Newbie question on mutexes Message-ID: > From: Jamie Lokier [mailto:jamie at shareable.org] > Perez-Gonzalez, Inaky wrote: > > or the nonstrict, competitive one (did you call it stochastic?). > > No, competitive and stochastic are different. > ... > ... True, my fault; ok, so now I get your picture. We would apply stochastic to strict transferal for SCHED_OTHER, it would make no sense for non-strict, which by definition, is competitive. > > On doing an automatic guess as you mention--that makes sense and seems > > very doable, but for RTNPTL+fusyn we'd have to refine the selection rules, > > but those are implementation details. > > It's not a guess, it's a rule. :) Potatoe, potato...:) Ok, so then by adding a force-always-strict switch for applications that want guaranteed no-matter-what robustness on SCHED_OTHER, we are fine, and only those get penalized. > As you know, Linus has often rejected priority protection and > inheritence within the kernel itself, on the grounds that it doesn't > provide complete protection anyway, so why pretend, and is rather complicated. > > What's you justification for making 4 a requirement? Well, let's say 4.1 and 4.2 -- 4.1 is not as tough and heavy Bff, let me see. It starts from vendors of equipment that want to port stuff they have working for other platforms (and I mean HUGE amounts of software stacks), and then the embedded people. I tend to agree that the best thing against it is either design the system with a lot of care (which is not always possible) or use PP. However, there are cases where there is no other choice, when the interaction between different modules is too complex as to be able to define the concept of a priority ceiling or when what is going to happen and you have to live it up to the inheritance system to resolve it. I know of big names who are holding on moving to Linux because of this. I also know it is going to be tough to get this past Linus. I tend to think that it will be more of a vendor-push item; it will have to live in some tree and prove itself useful and non disruptive before he will even touch it with a ten-foot pole [specially the prio inheritance/protection stuff]. > > 6. Uncontended locks/unlocks must happen without kernel intervention > > [this is always implied as needed, so I won't name it all the time > > during the discussion] > > I'd say they must happen without kernel intervention _most_ of the > time. The overhead must be low, but if say 0.1% of uncontented mutex > wakeups enter the kernel, that's not much overhead and it's still O(1). Why would they? The only ones that _really_ need to go through the kernel are: - Priority protection (they need to change prios anyway) - contended mutexes - dead mutexes [implementation detail in the fusyn case, that is one of those 0.1%] > > This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so > > the memory footprint per futex was huge--as well, we needed to take into > > account the possible -ENOMEM error code, as now we could not just hook > > each waiter to the hash chain, we needed a single node per futex and then > > sort them there [if not, walking the chain looking for the highest prio > > waiter would have made it O(N)]. This also introduced another lag, in that > > the "node" had to be allocated each time we waited. So, it didn't solve > > 3-5. > > Yes, but that was just because the implementation sucked. :) You bet--it was a nice playground though. > All of those things are solvable, efficiently and with no memory > allocation. Well, I haven't done so but I'm confident . :) Please make my day--I never found a way that didn't cause havoc and was cleanly O(1) [and hash tables or trees, no matter what, ain't O(1)]. Now, the think is you always need a central node representing each mutex to which you can hook up waiters. If not all your O(1) assumptions go boom; I have been able to cut all the O(1) assumptions except for the mutex hash table [and the first allocation, of course]. If you give me a nice solution for that, I owe you a keg :) There is a suggestion by [I don't remember his name right now] on solving the allocation problem by allocating a node per task, as a task cannot be waiting for more than one mutex at the time. However, this breaks caching and some more assumptions--I think is easier to make sure the kmem cache has as many mutexes as tasks are available plus epsilon. > > The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed > > bump on the allocation problem. The allocation problem was solved with > > a cache: the "node" would be only freed by a garbage collector if it had > > been unused for a while; as most mutexes in well-designed programs have > > a low contention rate, this avoiding of allocations improved the speed of > > the whole thing by a few factors. > > Sure, but the allocation is not needed. Instead of a hash table of > waiters, you simply have a hash table of priority queues - and > priority queues are simply a certain kind of tree, so each node is a > small fixed size and exists in the context of each waiter. Exactly, and that's what it is now (or will be when I make plist be O(1)). Still you need the allocation, you cannot create the wait list head out of nowhere; you cannot allocate it on the stack. You also need to do automatic disposals [and hence the GC/cache] of it because POSIX semantics for declaration and destruction of mutexes are so brain damaged that there is no such concept of forcing the usual create/use/destroy sequence. You could end up with an application that accumulates thousands of mutexes that it created during execution and never destroyed. Analogy: an application that open files and never closes them; sure on exit the kernel will do, but until now, and assuming it does not hit the limit, it is using only one fd while 9999 are unused and wasting memory; now in this case, blame the application because it didn't close(), but POSIX allows this: struct kk { int *data; mutex_t *mutex; void *private; }; void * thread_fn (void *arg) { int weird_factor; struct kk *kk = arg; weird_factor = computation_based_on (kk->private); mutex_lock(kk->mutex); *kk->data += weird_factor; mutex_unlock (kk->mutex); return NULL } void some_function() { int data = 0; mutex_t mutex = MUTEX_INITIALIZER; struct kk kk1 = { .data = &data, .mutex = &mutex, .private = private1 }; struct kk kk2 = { .data = &data, .mutex = &mutex, .private = private2 }; thread_descr_t thread_1_descr, thread_2_descr; spawn_thread (thread_1_descr, thread_fn, &kk1); spawn_thread (thread_2_descr, thread_fn, &kk2); join (thread_1_descr); join (thread_2_descr);} return data; }; Now, notice that we don't explicitly call mutex_destroy() anywhere, because POSIX doesn't require it. I (myself) would do it, but it doesn't say anywhere that it has to be done, and in fact, many programs are written like that. If we call some_function() once or twice, we are fine; but if we call it each second in a daemon, for example, we'll run out of kernel memory, because each time, potentially, mutex will be in a different memory position and thus a new structure be allocated. [unless you have the solution for the allocation, of course :)] > That isn't robust: PIDs are recycled relatively quickly. It is > deliberate, to keep PIDs from being large numbers if the number of > processes doesn't require that. Aha, so we need to add something more; one choice is to limit PIDs to 64k and use the rest of the 65k for hashing out the creation time jiffies of the struct task_struct to account for the recycling (see the fixme in the last release of __ufulock_id_owner()). This, for example, being pretty simple requires nothing more than a new getxpid() system call and doesn't involve needing to house keep yet-another-lookup-table of identifiers. [pseudocode] int sys_getxpid() { return current->pid | hash (¤t->creation_time, 16 bits) << 16; } Verifying is also pretty simple: task = find_task_by_pid (xpid & 0xfffff); if (hash (&task->creation_time, 16 bits) << 16 != xpid & 0xffff0000) return NULL; return task; Now that limits PIDs to 64k; you can play with using more or less bits in one direction or the other--and I would like to find out how does the pid allocator behave under pressure, but it is a simple way to proceed. > Also, it confines locks to the process: they cannot be passed between > processes easily. PID as in struct task_struct->pid, not as in getpid()--sorry for the confusion, I guess we should call it TID. > Also, you cannot detect when a _thread_ dies, which is sometimes > useful: threads within a process are sometimes quite logically > independent. I guess this comes from the PID as in getpid() assumption > Here's a proposal: each task fetches one or more "owner ids" from a > large space which is not recycled quickly... Or just use another instance of the PID allocator--doing this is another implementation detail. I like this idea because we don't have to play any more tricks, but is yet more in-kernel memory usage for the housekeeping [and more allocations for the file descriptor]. On a side note, a daemon would keep using and using identifiers with no end and could end up exhausting the id space. > It's unfortunate that compare-and-exchange is needed, though, as some > Linux platforms don't have it - but that can be worked around, by > always entering the kernel on those platforms (just for the sake of > software compatibility). Or they can default to use normal futexes or fuqueues for the fast case and fulocks for robust ones [it would mess it up a wee bit at glibc]. Flexible :) > > 4.1. It is solved through strict ownership transferal. We know that > > it's slow, so we need to switch it on ONLY when we care about it. We > > care about it when we are doing RT and we don't want priority > > inversion at all and when we want the safest robustness ever. Up to > > the application to switch it on, but both modes, strict and > > non-strict, are pretty easy to implement > > I don't see why you would ever _not_ want strict transferral when a > waiter has a strictly higher RT priority. That's pretty much what RT > priorities are for. Wait, wait, where did I say that? I guess my phrasing was messy (you get to love English as a second language, don't you?). I meant we are doing RT _and_ as part of that, we want to close all the chances of priority inversion. > When scheduling among SCHED_OTHER tasks, strict transferral doesn't > provide any logical guarantees except on a lightly loaded system, > simply because the scheduler doesn't provide guarantees of any kind. > A task might be runnable but not run for a long time. However I can > see some applications might want to request it. I was thinking that it might help to improve the interactivity of the system, as we'd be putting tasks ready to acquire the lock to do their thing and keep going...this is weird lucubration, never mind that much. > I think that _most_ application would want the kernel to automaticaly > do ownership transferral if there's a strictly higher RT priority task > waiting, and use the default scheduling otherwise. Agreed--one tidbit though. This helps solving the convoy problem, but also opens a window for prio inversion on SMP systems when we are unlocking to a lower prio RT task and a non-RT task on another CPU just takes the lock. As long as we document that this default behaviour would have this issue and we provide switches to control how we want to do it (that you already mentioned) it should be fine. He who needs always-strict will have to avoid convoys himelf. > > (as a side note, in RTNPTL+fusyn, non-strict should be a little bit > > faster because we don't need to hop up and down from/to the kernel). > > How do you avoid entering the kernel when you release a lock with a > waiter, if you do non-strict transferral? Why do you need to enter > the kernel when you release a lock with no waiter, if you do strict > transferral? > > In other words, I don't see why non-strict makes any difference to the > number of times you'll enter the kernel. Why? First let me clarify the case I was making: I was talking about the waiters being woken up to acquire the lock. In NPTL, they come up from the kernel and if they see it locked, they go down to the kernel again and queue up. ufulocks do that in the kernel (because they know about the semantics), so for the case when the waiter came out of the kernel, saw it locked by somebody else and went back to sleep, we save one hop up and another down [as well as the hash lookups]. > Once you get to priority inheritance, I think there's a good case that > it should be implemented right in the kernel's waitqueues, as a real > time option, rather than only working for userspace locks. Especially > as you had to dig into sched.c anyway. It is not into the waitqueues, but it is into fulocks, which are usable inside the kernel as a mutex; waitqueues will never get it because they lack the ownership concept anyway. > That'll most likely be rejected from the main kernel tree, due to the > performance penalty and complexity (unless you come up with a really > good implementation), Yeap, the last stuff I released is still too dirty, works as a proof of concept; I need to find time to release the latest changes I have, where it is way cleaner [and still can go a wee bit more by adding a couple of fields in task_struct]. > Those optimisation concerns are all in the wrong places. > > "O(140)" algorithms will run faster in practice just because the > pointer manipulation is simpler. > > In reality, you are not going to have large chains of strictly > different priorities, so the inheritance propagation code should focus > on being lean and simple in the common cases, where there's only zero > or one other priority to worry about. Again, priority queues can be > helpful for optimising these algorithms. Agreed there, so that's why the O(140) is only for the wait list that hangs off the per-lock node. The prio inheritance algorithm is way simple, as simple as it can get, and it will always be O(N) on the number of people waiting in wait/ownership chain (A waits for F, owned by B who waits for G, owned by C who waits for H, owned by...). Having an O(1) [O(140)] algorithm for the per-lock wait list manipulation is crucial to keep it down, because if not, it can easily go ballistic. Actually, it is all we need... > The hash table is definitely _not_ a bottleneck. Firstly, it is > effectively "O(1)" until you have a large number of simultaneously > waiting tasks. If those are RT tasks, RT performance is useless > anyway because you can't provide any useful RT latency guarantees when > there are many thousands of RT tasks. It is the minute the hash chains grow up, and that will happen when you have many locks active (waited for), both in the futex case and fulocks (even longer hash chains in the futex case). In fulocks, in the hash table, you have nodes, you have one node per lock. As long as the number of locks is low, you are fine. I have seen applications that have left one thousand of locks in the hash in a single run. Now the hash is a problem. It's low priority though, somebody that does that deserves a well tempered punch. > > But remember, futexes have no ownership concept; they are not a lock. Is > > it ok to assume you are talking about another entity similar to futex > > but with ownership concept? > > No, I mean futexes as they are. They do have an ownership concept: > when userspace calls into the kernel futex_wake(), _then_ the owner is > clearly the task doing the call. That is the appropriate time to do but it is not always the owner who unlocks a lock. POSIX claims this as undefined, but everybody and their mum use it and allow it, so we need to support it. As well, you not only need the ownership concept, but the "am I locked?" concept. You could argue the userspace word does that [that it does], but for added robustness, once the kernel knows about the lock, it takes precendence. More on that: robustness: do_exit() needs to know which locks we still own to mark them as dead and unlock the first waiter for him to take care. As well, to maintain the prio inheritance semantics, when a task queues up, it has to decide if it needs to boost the owner; for that, she needs to access it. To sum up: you need the concept of ownership to do many things. > Do you mean your approach causes the Mars Rover problem, or mine does? :) I mean yours would have solved it; damn JPL guys had to command the rover to enable PI to avoid priority inversions :] hah, talk about remote administration. Boy this thread is getting long... I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault) From inaky.perez-gonzalez at intel.com Mon Feb 23 22:13:23 2004 From: inaky.perez-gonzalez at intel.com (Perez-Gonzalez, Inaky) Date: Mon, 23 Feb 2004 14:13:23 -0800 Subject: Newbie question on mutexes Message-ID: Brain fart correction (thanks to Alexander) > From: Perez-Gonzalez, Inaky > ... > void some_function() > { > int data = 0; > mutex_t mutex = MUTEX_INITIALIZER; > struct kk kk1 = { .data = &data, .mutex = &mutex, .private = private1 }; > struct kk kk2 = { .data = &data, .mutex = &mutex, .private = private2 }; > thread_descr_t thread_1_descr, thread_2_descr; > spawn_thread (thread_1_descr, thread_fn, &kk1); > spawn_thread (thread_2_descr, thread_fn, &kk2); > join (thread_1_descr); > join (thread_2_descr);} > return data; > }; > > Now, notice that we don't explicitly call mutex_destroy() anywhere, > because POSIX doesn't require it. I (myself) would do it, but it > doesn't say anywhere that it has to be done, and in fact, many > programs are written like that. > ... "I (myself) would" should read "I (myself) would not", and more important: MUTEX_INITIALIZER is valid only for statics. I don't know why I was so sold into this, maybe because I have read too much (obviously incorrect) code that uses it and I was stupid enough not to check against the letter of POSIX. My fault, with no more excuses. This kind of renders my argument moot in the strict need of automatic disposal--I still think is needed, to simplify the caching, but maybe there are (once again) cracks in my argumentation. I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault) From sebastien.decugis at ext.bull.net Tue Feb 24 16:53:32 2004 From: sebastien.decugis at ext.bull.net (Sebastien Decugis) Date: Tue, 24 Feb 2004 17:53:32 +0100 Subject: NPTL tests coverage Message-ID: <1077641609.7978.5.camel@decugiss.frec.bull.fr> Hello, I have started writing a whitepaper "NPTL testing coverage" which will describe the current situation. I will include tests from the Linux Test Project (most of which come from the Open POSIX Test Suite) and those from libc/nptl source tree. I am focusing on two main axes: -> conformance to POSIX -> function stress This whitepaper will then be used to extend the existing testsuites in order to improve the coverage in conformance and in stressing. Having this in mind, I am currently examining tests from LTP and from NPTL. Whereas the LTP already contains a short descriptive text of each tests, this seems not to be the case for NPTL. Would you be interrested in me writing those abstracts? If anybody has comments/suggestions, they are welcome. Best regards -- S?bastien DECUGIS Bull S.A. From crystal.xiong at intel.com Wed Feb 25 05:41:43 2004 From: crystal.xiong at intel.com (Xiong, Crystal) Date: Wed, 25 Feb 2004 13:41:43 +0800 Subject: NPTL tests coverage Message-ID: LTP only ported the stable version of POSIX Test Suite (PTS). Currently it is Version 1.3.0. From that version, we had made some changes to the code and fix some bugs in message queue, timer and threads test cases. So if you want to use the latest code of PTS, please check it out directory from PTS CVS. cvs -d:pserver:anonymous at cvs.sourceforge.net:/cvsroot/posixtest login cvs -z3 -d:pserver:anonymous at cvs.sourceforge.net:/cvsroot/posixtest co modulename Thanks, Crystal --------------------------------------------- * This is only my personal opinion * > -----Original Message----- > From: phil-list-admin at redhat.com [mailto:phil-list-admin at redhat.com] On > Behalf Of Sebastien Decugis > Sent: 2004?2?25? 0:54 > To: phil-list at redhat.com > Subject: NPTL tests coverage > > Hello, > > I have started writing a whitepaper "NPTL testing coverage" which will > describe the current situation. I will include tests from the Linux Test > Project (most of which come from the Open POSIX Test Suite) and those > from libc/nptl source tree. I am focusing on two main axes: > -> conformance to POSIX > -> function stress > > This whitepaper will then be used to extend the existing testsuites in > order to improve the coverage in conformance and in stressing. > > Having this in mind, I am currently examining tests from LTP and from > NPTL. Whereas the LTP already contains a short descriptive text of each > tests, this seems not to be the case for NPTL. Would you be interrested > in me writing those abstracts? > > If anybody has comments/suggestions, they are welcome. > > Best regards > > -- > S?bastien DECUGIS > Bull S.A. > > > -- > Phil-list mailing list > Phil-list at redhat.com > https://www.redhat.com/mailman/listinfo/phil-list From drepper at redhat.com Fri Feb 27 08:23:54 2004 From: drepper at redhat.com (Ulrich Drepper) Date: Fri, 27 Feb 2004 00:23:54 -0800 Subject: setinheritsched Message-ID: <403EFE9A.20908@redhat.com> In case anybody is interested, the pthread_attr_setinheritsched functionality should now be correctly implemented. -- ? Ulrich Drepper ? Red Hat, Inc. ? 444 Castro St ? Mountain View, CA ?