Newbie question on mutexes

Fri Feb 20 15:29:17 UTC 2004

Perez-Gonzalez, Inaky wrote:
> or the nonstrict, competitive one (did you call it stochastic?).

No, competitive and stochastic are different.

I'd say competetive scheduling for ownership transferral is where the
kernel's dynamic scheduling priority heuristic decides who gets to run
next.

Stochastic transferral adds a twist which says ownership will be
transferred even if the dynamic priorities indicate otherwise, some of
the time (not very often).

This is intended to break livelock starvation scenarios which occur
due to the dynamic priority heuristic synchronising with the program.

(The mutex starvation example recently is an example of that.
Although moving competitive ownership transferral to the kernel solves
that particular arrangement, there are others that it doesn't solve).

The stochastic element doesn't provide any guarantee as to how fast
the starvation will be broken, just that it will eventually (if the
randomisation is good).  Also, even this only breaks starvations
involving a single lock; more complex ones can still livelock, I
think.  (I haven't given it a lot of thought; it just seems likely).

> Now, this gives the best of both worlds, so we don't force anything.
> 
> On doing an automatic guess as you mention--that makes sense and seems 
> very doable, but for RTNPTL+fusyn we'd have to refine the selection rules, 
> but those are implementation  details.

It's not a guess, it's a rule. :)

Just like RT priority means you _definitely_ will run higher priority
threads when they become runnable, it can also mean you _definitely_
pass them mutex ownership when they are waiting on the mutex.

> In some sense, the RT requirements for locks boil down to:
> 
> 1. O(1) operations everywhere [as possible]
> 2. Wake up/unlock by priority
> 3. Robustness (when a lock owner dies, first waiter is woken up owning
>    the lock with an special error code).
> 4. Minimization of priority inversion:
> 4.1 on the acquisition of the lock
> 4.2 when the lock is acquired by a lower prio owner: priority protection
> 4.3 when the lock is acquired by a lower prio owner: priority inheritance
> 5. deadlock detection

As you know, Linus has often rejected priority protection and
inheritence within the kernel itself, on the grounds that it doesn't
provide complete protection anyway, so why pretend, and is rather complicated.

What's you justification for making 4 a requirement?

> 6. Uncontended locks/unlocks must happen without kernel intervention
>    [this is always implied as needed, so I won't name it all the time
>     during the discussion]

I'd say they must happen without kernel intervention _most_ of the
time.  The overhead must be low, but if say 0.1% of uncontented mutex
wakeups enter the kernel, that's not much overhead and it's still O(1).

So far all our implementations do it all the time, but it's feasible
that an RT locking mechanism might choose to enter the kernel
occasionally, for example for stochastic or accounting purposes.

> If you remember, maybe not, my first inroad into this was a patch that
> modified the selection rules of the futex to have them sorted by priority
> using an adaptation of Ingo's prioarray list. It was the pfutex patch.

Yes.

> This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so 
> the memory footprint per futex was huge--as well, we needed to take into
> account the possible -ENOMEM error code, as now we could not just hook
> each waiter to the hash chain, we needed a single node per futex and then
> sort them there [if not, walking the chain looking for the highest prio
> waiter would have made it O(N)]. This also introduced another lag, in that
> the "node" had to be allocated each time we waited. So, it didn't solve
> 3-5.

Yes, but that was just because the implementation sucked. :)

All of those things are solvable, efficiently and with no memory
allocation.  Well, I haven't done so but I'm confident <swagger>. :)

> The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed 
> bump on the allocation problem. The allocation problem was solved with 
> a cache: the "node" would be only freed by a garbage collector if it had 
> been unused for a while; as most mutexes in well-designed programs have 
> a low contention rate, this avoiding of allocations improved the speed of 
> the whole thing by a few factors.

Sure, but the allocation is not needed.  Instead of a hash table of
waiters, you simply have a hash table of priority queues - and
priority queues are simply a certain kind of tree, so each node is a
small fixed size and exists in the context of each waiter.

> As for the first one, there is no way to know who has locked it if the word
> in userspace just contained some random number, depending on how the locking
> impl was done. We needed an identifier that we could use to map to a task.
> The pid comes as natural, so we chose that. This turns in the requirement for
> an cmp and exchange atomic op to place a PID in the userspace word. This way,
> we can lock fast, and if we die, a second waiter will see the PID, go to the
> kernel to wait and the kernel will use this opportunity to identify the owner
> of the lock based on the value of the userspace word. If not found, it means 
> it died and recovery is started as above.

That isn't robust: PIDs are recycled relatively quickly.  It is
deliberate, to keep PIDs from being large numbers if the number of
processes doesn't require that.

Also, it confines locks to the process: they cannot be passed between
processes easily.

Also, you cannot detect when a _thread_ dies, which is sometimes
useful: threads within a process are sometimes quite logically
independent.

Here's a proposal: each task fetches one or more "owner ids" from a
large space which is not recycled quickly.  Each owner id is
associated with a file descriptor.  (In your scenarios, this means a
task opens one such descriptor and keeps it open for the duration of
the task).  When the file descriptor is closed, that owner id is freed
and any mutexes with are are considered dead.

That grants you the capability to have locks where the ownership is
shared collectively (by sharing the fd between processes), and where
the ownership is confined to one thread or a group of threads (by
having the fd open only in those threads - that's a clone()
capability, outside the scope of POSIX but useful for some libraries).

> However, having 3 means that we need to maintain the health status
> of the lock, as it can be healthy (normal), dead (some previous
> owner died) and not-recoverable (the data protected by this lock has
> been deemed inconsistent and nothing can be done, abort). That is
> used by robust applications as a way between users of a lock to know
> what needs to be done. While the kernel knows about the lock, it is
> easy, because it is in control and we keep it there; but if the lock
> is removed from the cache, we need the next locker in userspace to
> know that the lock is dead or healthy or whatever. The only place to
> put that without creating more havoc is the user space word. We
> could use another word, contiguous, but that would make modifying it
> from kernel space a pain.

I agree it has to be the same word.  Many platforms don't support
64-bit atomic operations, and you need a way to test-and-acquire a
lock while setting the owner id iff the lock is acquired.

It's unfortunate that compare-and-exchange is needed, though, as some
Linux platforms don't have it - but that can be worked around, by
always entering the kernel on those platforms (just for the sake of
software compatibility).

> 4.1. It is solved through strict ownership transferal. We know that
> it's slow, so we need to switch it on ONLY when we care about it. We
> care about it when we are doing RT and we don't want priority
> inversion at all and when we want the safest robustness ever.  Up to
> the application to switch it on, but both modes, strict and
> non-strict, are pretty easy to implement

I don't see why you would ever _not_ want strict transferral when a
waiter has a strictly higher RT priority.  That's pretty much what RT
priorities are for.

When scheduling among SCHED_OTHER tasks, strict transferral doesn't
provide any logical guarantees except on a lightly loaded system,
simply because the scheduler doesn't provide guarantees of any kind.
A task might be runnable but not run for a long time.  However I can
see some applications might want to request it.

I think that _most_ application would want the kernel to automaticaly
do ownership transferral if there's a strictly higher RT priority task
waiting, and use the default scheduling otherwise.

That's because the default scheduling is most efficient, so it will be
requested by most code, but if your program has an RT thread, you very
likely want _all_ code in the program, including libraries that aren't
RT-aware, to do strict transferral whenever there's an RT thread
waiting on a lock.  For example: a Tcl/Tk GUI calling a library which
spawns an RT data acquisition thread.  Whenever the Tcl interpreter
releases a lock, if the RT thread is waiting it should be passed
ownership immediately, yet the Tcl interpreter code is not going to be
written with the assumption that there's an RT thread waiting, and
will use the default "high performance" lock mode for the object locks.

Therefore that automatic decision should be the default mode
implemented by the kernel.  Forcing strict transferral is ok as an
alternative, but it won't be the default that most library code uses.

> (as a side note, in RTNPTL+fusyn, non-strict should be a little bit
> faster because we don't need to hop up and down from/to the kernel).

How do you avoid entering the kernel when you release a lock with a
waiter, if you do non-strict transferral?  Why do you need to enter
the kernel when you release a lock with no waiter, if you do strict
transferral?

In other words, I don't see why non-strict makes any difference to the
number of times you'll enter the kernel.  Why?

> For 4.2 and 4.3, we have the usual mechanisms -- this is painful and adds
> a lot of code. Basically it involves walking up a chain of changes and
> propagating priorities. rtfutex had it, but it was _far_ from correct,
> clean and solid. fusyn has it way better under control (needs some 
> improvements in the modifications to the sched.c code though). The actual
> reason why there is a lot of code is to protect against things disappearing
> below our feet. As well, for being able to do priority inheritance and
> protection, you need to know who is the owner, so we are back at needing
> the concept of ownership.

Once you get to priority inheritance, I think there's a good case that
it should be implemented right in the kernel's waitqueues, as a real
time option, rather than only working for userspace locks.  Especially
as you had to dig into sched.c anyway.

That'll most likely be rejected from the main kernel tree, due to the
performance penalty and complexity (unless you come up with a really
good implementation), and because Linus doesn't agree with priority
inheritance on principle.

Then again he didn't agree with kernel pre-emption until a good
implementation with demonstrable benefits appeared.

> Side notes on (1). I want to change the priority list implementation from
> the O(N) that addition is now to an O(Number of priorities) == O(140)
> implementation, that effectively is O(1). Once this is done, the only
> O(N) operations in fusyn will be the hash lookup and the priority 
> inheritance propagation. The last one has no way to be solved, it is like
> that. The first one could be solved doing some magic that would require
> no hash table, by storing a cookie straight into the user space (by the
> word, for example).

Those optimisation concerns are all in the wrong places.

A priority queue is easily implemented in O(log n) where n is the
number of entries in it, _or_ the number of distinct priorities in it
at any time, _or_ amortised in various ways.  It's possible that some
"O(140)" algorithms will run faster in practice just because the
pointer manipulation is simpler.

In reality, you are not going to have large chains of strictly
different priorities, so the inheritance propagation code should focus
on being lean and simple in the common cases, where there's only zero
or one other priority to worry about.  Again, priority queues can be
helpful for optimising these algorithms.

The hash table is definitely _not_ a bottleneck.  Firstly, it is
effectively "O(1)" until you have a large number of simultaneously
waiting tasks.  If those are RT tasks, RT performance is useless
anyway because you can't provide any useful RT latency guarantees when
there are many thousands of RT tasks.

Secondly, at the moment hash buckets contain a list of waiters hashing
to the same bucket.  If you had a large number of stagnant waiters all
in the same bucket (probably because you created 1000 sleeping tasks
all waiting on the same lock), and you were busily doing futex
operations on a single address, then the busy tasks do pay a big
penalty which is not necessary.

The solution is for hash buckets to contain a list of _locks_, and
each entry in the list having a list of waiters for that lock (or for
our purposes, a each entry has a priority queue of waiters for that
lock).  I think this is much like what you did with those "allocated
nodes" (sorry, I didn't read the code in detail), although the
allocation and GC aren't needed with an appropriate data structure.

> >     - Ownership transferral is made possible using a futex wait+modify
> >       operation.  The modify part can fail, either if the word doesn't
> >       have the correct comparison value, or if the waker decides not
> >       to do ownership transferral.
> > 
> >     - futex_wake always does ownership transferal when there's a
> >       higher RT priority waiter, and to the highest priority one of
> >       course.
> 
> But remember, futexes have no ownership concept; they are not a lock. Is
> it ok to assume you are talking about another entity similar to futex
> but with ownership concept?

No, I mean futexes as they are.  They do have an ownership concept:
when userspace calls into the kernel futex_wake(), _then_ the owner is
clearly the task doing the call.  That is the appropriate time to do
ownership transferral, by waking up another task and _telling_ that
task it has been woken as a result of ownership transferral.

> I see -- these are basically the policies to decide when and how to
> do it. I am more inclined (as I think I said somewhere) to let the
> application choose and we default to non-strict; maybe trying to be
> too smart will make unlock() expensive, maybe not, but I see the 
> positive side of your approach, in not needing to have the application
> know about those issues [and we would not have the same problem that
> the Mars Rover had, for example...]

Do you mean your approach causes the Mars Rover problem, or mine does? :)

I think it is fine to give application options, although it's
preferable if those don't slow down the fast paths.  However, I think
the option applications should use by default should be one where the
kernel does strict ownership transferral only to higher priority RT
tasks, and applications _must_ not be expected to know whether there
is such a task waiting.

Enjoy,
-- Jamie