Newbie question on mutexes

Fri Feb 20 09:11:59 UTC 2004

> From: Jamie Lokier [mailto:jamie at shareable.org]
> Perez-Gonzalez, Inaky wrote:

Well, first of all, sorry for the late response; I've been up to
my neck these last two weeks.

> Not doing ownership transferral is essential to avoid excessive scheduling.

Agreed completely. We are _very_ aware of that problem and are actively
working on a solution to it, that I've outlined already: on the 
RTNPTL+fusyn case, it is up to the application to decide what kind of 
unlock is done, if strict ownership transferal (expensive one in scheduling 
terms  in depending which cases) or the nonstrict, competitive one (did 
you call it stochastic?). Of course, we default to the last one.

Now, this gives the best of both worlds, so we don't force anything.

On doing an automatic guess as you mention--that makes sense and seems 
very doable, but for RTNPTL+fusyn we'd have to refine the selection rules, 
but those are implementation  details.

> > > One clean and general way to implement general ownership transferral
> > > may be a combined wake+wait kernel primitive: you tell the kernel to
> > > wait on a futex until the word has a certain matching value (and
> > > futex_wake is called), then the kernel substitutes a different value
> > > and wakes you.  You specify both values.
> >
> > That's more or less the basis of fulocks :)
> 
> Oh, you mean we can just add the function to futex and slip it in
> quietly that way? :)

Heh heh...you are getting ahead. 

I said basis because they are the basis, not the base. I think it is time
for a little bit of history of how fusyn came alive and why each piece is
there; I hope this will help you (and anyone else interested) understand 
the reason for each decission. 

In some sense, the RT requirements for locks boil down to:

1. O(1) operations everywhere [as possible]
2. Wake up/unlock by priority
3. Robustness (when a lock owner dies, first waiter is woken up owning
   the lock with an special error code).
4. Minimization of priority inversion:
4.1 on the acquisition of the lock
4.2 when the lock is acquired by a lower prio owner: priority protection
4.3 when the lock is acquired by a lower prio owner: priority inheritance
5. deadlock detection
6. Uncontended locks/unlocks must happen without kernel intervention
   [this is always implied as needed, so I won't name it all the time
    during the discussion]

If you remember, maybe not, my first inroad into this was a patch that
modified the selection rules of the futex to have them sorted by priority
using an adaptation of Ingo's prioarray list. It was the pfutex patch.

This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so 
the memory footprint per futex was huge--as well, we needed to take into
account the possible -ENOMEM error code, as now we could not just hook
each waiter to the hash chain, we needed a single node per futex and then
sort them there [if not, walking the chain looking for the highest prio
waiter would have made it O(N)]. This also introduced another lag, in that
the "node" had to be allocated each time we waited. So, it didn't solve
3-5.

The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed 
bump on the allocation problem. The allocation problem was solved with 
a cache: the "node" would be only freed by a garbage collector if it had 
been unused for a while; as most mutexes in well-designed programs have 
a low contention rate, this avoiding of allocations improved the speed of 
the whole thing by a few factors.

Now 3 was more painful. In order to be able to pass ownership when an
owner dies, the waiters must be able to identify a dead owner. The two basic
cases of this are when an owner died with no waiters and when it died with
waiters. The later is simple, as long as the kernel knows about the owner,
and traps it exiting without unlocking in do_exit(). This implied that the
futex (or whatever) needed to have the concept of ownership, as for who is
owning it. As well, for the kernel to be able to identify who it owned
during do_exit(), each task needs an ownership list where the locks it 
owns are registered.

As for the first one, there is no way to know who has locked it if the word
in userspace just contained some random number, depending on how the locking
impl was done. We needed an identifier that we could use to map to a task.
The pid comes as natural, so we chose that. This turns in the requirement for
an cmp and exchange atomic op to place a PID in the userspace word. This way,
we can lock fast, and if we die, a second waiter will see the PID, go to the
kernel to wait and the kernel will use this opportunity to identify the owner
of the lock based on the value of the userspace word. If not found, it means 
it died and recovery is started as above.

Recapping the last two paragraphs: the futex needs an ownership concept;
I did this in rtfutex, and results in a mess, because the futex is just
a queue, and so is its interface. This led me to add a new type of object,
the fulock, that supports the ownership concept.

Now, we also need to indicate that there are waiters in the kernel, so userspace
does the slow path when unlocking and we work correctly. In rtfutex we had a
bit (31st) set to indicate that, but it proved to be a PITA to maintain and
made the userspace operation way more complex. It worked though.

However, having 3 means that we need to maintain the health status of the
lock, as it can be healthy (normal), dead (some previous owner died) and
not-recoverable (the data protected by this lock has been deemed inconsistent
and nothing can be done, abort). That is used by robust applications as a way
between users of a lock to know what needs to be done. While the kernel 
knows about the lock, it is easy, because it is in control and we keep it
there; but if the lock is removed from the cache, we need the next locker
in userspace to know that the lock is dead or healthy or whatever. The only
place to put that without creating more havoc is the user space word. We
could use another word, contiguous, but that would make modifying it from
kernel space a pain.

We could use another bit, but that'd create more hassle in the lock operation, 
so we used a value (0xffffffff); the user space op on seeing that goes down to
the kernel and the kernel does the lock operation for a dead lock, because
there is no place to put the PID without overwriting the health status. That 
simplifies many things, and being an special case, the speed toll is acceptable.

[similarly for not-recoverable, 0xfffffffe, and all operations fail].

As we are talking about the userspace word, we get to requirement number
4.1. It is solved through strict ownership transferal. We know that it's slow,
so we need to switch it on ONLY when we care about it. We care about it
when we are doing RT and we don't want priority inversion at all and when 
we want the safest robustness ever. Up to the application to switch it on,
but both modes, strict and non-strict, are pretty easy to implement (as
a side note, in RTNPTL+fusyn, non-strict should be a little bit faster
because we don't need to hop up and down from/to the kernel).

For 4.2 and 4.3, we have the usual mechanisms -- this is painful and adds
a lot of code. Basically it involves walking up a chain of changes and
propagating priorities. rtfutex had it, but it was _far_ from correct,
clean and solid. fusyn has it way better under control (needs some 
improvements in the modifications to the sched.c code though). The actual
reason why there is a lot of code is to protect against things disappearing
below our feet. As well, for being able to do priority inheritance and
protection, you need to know who is the owner, so we are back at needing
the concept of ownership.

And finally, number (5). This is a simple one, as it involves making sure
there are no loops in the chain of waiters/owners.

Side notes on (1). I want to change the priority list implementation from
the O(N) that addition is now to an O(Number of priorities) == O(140)
implementation, that effectively is O(1). Once this is done, the only
O(N) operations in fusyn will be the hash lookup and the priority 
inheritance propagation. The last one has no way to be solved, it is like
that. The first one could be solved doing some magic that would require
no hash table, by storing a cookie straight into the user space (by the
word, for example). Research material. Other O(N) ops are inherent and
only used from time to time (dead owner recovery on many locks, switching
to not-recoverable each waiter on a lock, etc, etc), nothing major to
worry about.

Sorry for the long story, but I cannot summarize it any more -- I hope
it explains correctly why futexes ain't enough when we are trying to
do all that. They work pretty well for what we have now, but sadly, they
fall sort for other stuff.

> > > mentioned above.  Also it could stochastically prevent starvation,
> > > without forcing slow convoys, by occasionally deciding to transfer
> > > ownership for non-RT tasks based on a random number.
> >
> > But that breaks the determinism you want for RT applications. In RT
> > you want to force a guy to starve. If it is starving and I don't know
> > why, then I have a bigger problem.
> 
> I mean (this is with a bit more detail to help the SMP case):
> 
>     - Ownership transferral is made possible using a futex wait+modify
>       operation.  The modify part can fail, either if the word doesn't
>       have the correct comparison value, or if the waker decides not
>       to do ownership transferral.
> 
>     - futex_wake always does ownership transferal when there's a
>       higher RT priority waiter, and to the highest priority one of
>       course.

But remember, futexes have no ownership concept; they are not a lock. Is
it ok to assume you are talking about another entity similar to futex
but with ownership concept?

>     - Otherwise, ...

I see -- these are basically the policies to decide when and how to
do it. I am more inclined (as I think I said somewhere) to let the
application choose and we default to non-strict; maybe trying to be
too smart will make unlock() expensive, maybe not, but I see the 
positive side of your approach, in not needing to have the application
know about those issues [and we would not have the same problem that
the Mars Rover had, for example...]

If you don't mind, I'd like you to go over the explanation I gave
above and tell me where do you agree and where do you not. I'd 
really like to make sure that you understand why did I make each design
decision and why futex could not fulfill it, and where necessary, to
understand why am I wrong.

Thx,

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)