Newbie question on mutexes

Sat Feb 21 17:24:01 UTC 2004

> From: Jamie Lokier [mailto:jamie at shareable.org]
> Perez-Gonzalez, Inaky wrote:
> > or the nonstrict, competitive one (did you call it stochastic?).
> 
> No, competitive and stochastic are different.
> ...
> ...

True, my fault; ok, so now I get your picture. We would apply
stochastic to strict transferal for SCHED_OTHER, it would make 
no sense for non-strict, which by definition, is competitive.

> > On doing an automatic guess as you mention--that makes sense and seems
> > very doable, but for RTNPTL+fusyn we'd have to refine the selection rules,
> > but those are implementation  details.
> 
> It's not a guess, it's a rule. :)

Potatoe, potato...:) 

Ok, so then by adding a force-always-strict switch for applications 
that want guaranteed no-matter-what robustness on SCHED_OTHER, we 
are fine, and only those get penalized.

> As you know, Linus has often rejected priority protection and
> inheritence within the kernel itself, on the grounds that it doesn't
> provide complete protection anyway, so why pretend, and is rather complicated.
> 
> What's you justification for making 4 a requirement?

Well, let's say 4.1 and 4.2 -- 4.1 is not as tough and heavy

Bff, let me see. It starts from vendors of equipment that want to port
stuff they have working for other platforms (and I mean HUGE amounts
of software stacks), and then the embedded people. I tend to agree that
the best thing against it is either design the system with a lot of
care (which is not always possible) or use PP. However, there are 
cases where there is no other choice, when the interaction between
different modules is too complex as to be able to define the concept
of a priority ceiling or when what is going to happen and you have
to live it up to the inheritance system to resolve it.

I know of big names who are holding on moving to Linux because of
this.

I also know it is going to be tough to get this past Linus. I tend
to think that it will be more of a vendor-push item; it will have
to live in some tree and prove itself useful and non disruptive
before he will even touch it with a ten-foot pole [specially the
prio inheritance/protection stuff].

> > 6. Uncontended locks/unlocks must happen without kernel intervention
> >    [this is always implied as needed, so I won't name it all the time
> >     during the discussion]
> 
> I'd say they must happen without kernel intervention _most_ of the
> time.  The overhead must be low, but if say 0.1% of uncontented mutex
> wakeups enter the kernel, that's not much overhead and it's still O(1).

Why would they? The only ones that _really_ need to go through the
kernel are:

  - Priority protection (they need to change prios anyway)
  - contended mutexes
  - dead mutexes [implementation detail in the fusyn case, that 
    is one of those 0.1%]

> > This patch solved 1 and 2; drawbacks many: prioarray lists are huge, so
> > the memory footprint per futex was huge--as well, we needed to take into
> > account the possible -ENOMEM error code, as now we could not just hook
> > each waiter to the hash chain, we needed a single node per futex and then
> > sort them there [if not, walking the chain looking for the highest prio
> > waiter would have made it O(N)]. This also introduced another lag, in that
> > the "node" had to be allocated each time we waited. So, it didn't solve
> > 3-5.
> 
> Yes, but that was just because the implementation sucked. :)

You bet--it was a nice playground though.

> All of those things are solvable, efficiently and with no memory
> allocation.  Well, I haven't done so but I'm confident <swagger>. :)

Please make my day--I never found a way that didn't cause havoc
and was cleanly O(1) [and hash tables or trees, no matter what,
ain't O(1)].

Now, the think is you always need a central node representing
each mutex to which you can hook up waiters. If not all your O(1) 
assumptions go boom; I have been able to cut all the O(1) assumptions
except for the mutex hash table [and the first allocation, of course]. 
If you give me a nice solution for that, I owe you a keg :)

There is a suggestion by [I don't remember his name right now] on
solving the allocation problem by allocating a node per task, as a task
cannot be waiting for more than one mutex at the time. However, this
breaks caching and some more assumptions--I think is easier to make
sure the kmem cache has as many mutexes as tasks are available plus
epsilon.

> > The next iteration, rtfutex, aimed to solve 3 and 4, as well as the speed
> > bump on the allocation problem. The allocation problem was solved with
> > a cache: the "node" would be only freed by a garbage collector if it had
> > been unused for a while; as most mutexes in well-designed programs have
> > a low contention rate, this avoiding of allocations improved the speed of
> > the whole thing by a few factors.
> 
> Sure, but the allocation is not needed.  Instead of a hash table of
> waiters, you simply have a hash table of priority queues - and
> priority queues are simply a certain kind of tree, so each node is a
> small fixed size and exists in the context of each waiter.

Exactly, and that's what it is now (or will be when I make plist be O(1)).
Still you need the allocation, you cannot create the wait list head out
of nowhere; you cannot allocate it on the stack. 

You also need to do automatic disposals [and hence the GC/cache] of it 
because POSIX semantics for declaration and destruction of mutexes are 
so brain damaged that there is no such concept of forcing the usual 
create/use/destroy sequence. You could end up with an application
that accumulates thousands of mutexes that it created during execution 
and never destroyed.

Analogy: an application that open files and never closes them; sure on
exit the kernel will do, but until now, and assuming it does not hit the
limit, it is using only one fd while 9999 are unused and wasting memory;
now in this case, blame the application because it didn't close(), but
POSIX allows this:

struct kk {
	int *data;
	mutex_t	*mutex;
	void *private;
};

void * thread_fn (void *arg)
{
	int weird_factor;
	struct kk *kk = arg;
	weird_factor = computation_based_on (kk->private);
	mutex_lock(kk->mutex);
	*kk->data += weird_factor;
	mutex_unlock (kk->mutex);
	return NULL
}
void some_function() 
{
	int data = 0;
	mutex_t mutex = MUTEX_INITIALIZER;
	struct kk kk1 = { .data = &data, .mutex = &mutex, .private = private1 };
	struct kk kk2 = { .data = &data, .mutex = &mutex, .private = private2 };	
	thread_descr_t thread_1_descr, thread_2_descr;
	spawn_thread (thread_1_descr, thread_fn, &kk1);
	spawn_thread (thread_2_descr, thread_fn, &kk2);
	join (thread_1_descr);
	join (thread_2_descr);}
	return data;
};

Now, notice that we don't explicitly call mutex_destroy() anywhere,
because POSIX doesn't require it. I (myself) would do it, but it 
doesn't say anywhere that it has to be done, and in fact, many 
programs are written like that. 

If we call some_function() once or twice, we are fine; but if we call
it each second in a daemon, for example, we'll run out of kernel memory,
because each time, potentially, mutex will be in a different memory
position and thus a new structure be allocated.

[unless you have the solution for the allocation, of course :)]

> That isn't robust: PIDs are recycled relatively quickly.  It is
> deliberate, to keep PIDs from being large numbers if the number of
> processes doesn't require that.

Aha, so we need to add something more; one choice is to limit
PIDs to 64k and use the rest of the 65k for hashing out the 
creation time jiffies of the struct task_struct to  account for the 
recycling (see the fixme in the last release of __ufulock_id_owner()). 

This, for example, being pretty simple requires nothing more than
a new getxpid() system call and doesn't involve needing to house keep
yet-another-lookup-table of identifiers.

[pseudocode] 

int sys_getxpid() {
	return current->pid | hash (&current->creation_time, 16 bits) << 16;
}

Verifying is also pretty simple:

	task = find_task_by_pid (xpid & 0xfffff);
	if (hash (&task->creation_time, 16 bits) << 16 != xpid & 0xffff0000)
		return NULL;
	return task;

Now that limits PIDs to 64k; you can play with using more or less bits
in one direction or the other--and I would like to find out how does
the pid allocator behave under pressure, but it is a simple way to proceed.

> Also, it confines locks to the process: they cannot be passed between
> processes easily.

PID as in struct task_struct->pid, not as in getpid()--sorry for
the confusion, I guess we should call it TID.

> Also, you cannot detect when a _thread_ dies, which is sometimes
> useful: threads within a process are sometimes quite logically
> independent.

I guess this comes from the PID as in getpid() assumption

> Here's a proposal: each task fetches one or more "owner ids" from a
> large space which is not recycled quickly...

Or just use another instance of the PID allocator--doing this is another
implementation detail. I like this idea because we don't have to play
any more tricks, but is yet more in-kernel memory usage for the 
housekeeping [and more allocations for the file descriptor]. 

On a side note, a daemon would keep using and using identifiers with
no end and could end up exhausting the id space.

> It's unfortunate that compare-and-exchange is needed, though, as some
> Linux platforms don't have it - but that can be worked around, by
> always entering the kernel on those platforms (just for the sake of
> software compatibility).

Or they can default to use normal futexes or fuqueues for the fast
case and fulocks for robust ones [it would mess it up a wee bit at
glibc]. Flexible :)

> > 4.1. It is solved through strict ownership transferal. We know that
> > it's slow, so we need to switch it on ONLY when we care about it. We
> > care about it when we are doing RT and we don't want priority
> > inversion at all and when we want the safest robustness ever.  Up to
> > the application to switch it on, but both modes, strict and
> > non-strict, are pretty easy to implement
> 
> I don't see why you would ever _not_ want strict transferral when a
> waiter has a strictly higher RT priority.  That's pretty much what RT
> priorities are for.

Wait, wait, where did I say that? I guess my phrasing was messy (you
get to love English as a second language, don't you?). I meant we
are doing RT _and_ as part of that, we want to close all the 
chances of priority inversion.

> When scheduling among SCHED_OTHER tasks, strict transferral doesn't
> provide any logical guarantees except on a lightly loaded system,
> simply because the scheduler doesn't provide guarantees of any kind.
> A task might be runnable but not run for a long time.  However I can
> see some applications might want to request it.

I was thinking that it might help to improve the interactivity of
the system, as we'd be putting tasks ready to acquire the lock to
do their thing and keep going...this is weird lucubration, never mind
that much.

> I think that _most_ application would want the kernel to automaticaly
> do ownership transferral if there's a strictly higher RT priority task
> waiting, and use the default scheduling otherwise.

Agreed--one tidbit though. This helps solving the convoy problem, but
also opens a window for prio inversion on SMP systems when we are unlocking
to a lower prio RT task and a non-RT task on another CPU just takes the
lock. 

As long as we document that this default behaviour would have this
issue and we provide switches to control how we want to do it 
(that you already mentioned) it should be fine. He who needs 
always-strict will have to avoid convoys himelf.

> > (as a side note, in RTNPTL+fusyn, non-strict should be a little bit
> > faster because we don't need to hop up and down from/to the kernel).
> 
> How do you avoid entering the kernel when you release a lock with a
> waiter, if you do non-strict transferral?  Why do you need to enter
> the kernel when you release a lock with no waiter, if you do strict
> transferral?
>
> In other words, I don't see why non-strict makes any difference to the
> number of times you'll enter the kernel.  Why?

First let me clarify the case I was making: I was talking about the
waiters being woken up to acquire the lock.

In NPTL, they come up from the kernel and if they see it locked, they
go down to the kernel again and queue up.

ufulocks do that in the kernel (because they know about the semantics),
so for the case when the waiter came out of the kernel, saw it locked
by somebody else and went back to sleep, we save one hop up and another
down [as well as the hash lookups].

> Once you get to priority inheritance, I think there's a good case that
> it should be implemented right in the kernel's waitqueues, as a real
> time option, rather than only working for userspace locks.  Especially
> as you had to dig into sched.c anyway.

It is not into the waitqueues, but it is into fulocks, which are usable
inside the kernel as a mutex; waitqueues will never get it because
they lack the ownership concept anyway.

> That'll most likely be rejected from the main kernel tree, due to the
> performance penalty and complexity (unless you come up with a really
> good implementation), 

Yeap, the last stuff I released is still too dirty, works as a proof of 
concept; I need to find time to release the latest changes I have,
where it is way cleaner [and still can go a wee bit more by adding
a couple of fields in task_struct].

> Those optimisation concerns are all in the wrong places.
> 
> "O(140)" algorithms will run faster in practice just because the
> pointer manipulation is simpler.
>
> In reality, you are not going to have large chains of strictly
> different priorities, so the inheritance propagation code should focus
> on being lean and simple in the common cases, where there's only zero
> or one other priority to worry about.  Again, priority queues can be
> helpful for optimising these algorithms.

Agreed there, so that's why the O(140) is only for the wait list 
that hangs off the per-lock node. The prio inheritance algorithm
is way simple, as simple as it can get, and it will always be
O(N) on the number of people waiting in wait/ownership chain 
(A waits for F, owned by B who waits for G, owned by C who waits
for H, owned by...). Having an O(1) [O(140)] algorithm for the
per-lock wait list manipulation is crucial to keep it down, 
because if not, it can easily go ballistic. Actually, it is all
we need...

> The hash table is definitely _not_ a bottleneck.  Firstly, it is
> effectively "O(1)" until you have a large number of simultaneously
> waiting tasks.  If those are RT tasks, RT performance is useless
> anyway because you can't provide any useful RT latency guarantees when
> there are many thousands of RT tasks.

It is the minute the hash chains grow up, and that will happen when
you have many locks active (waited for), both in the futex case and 
fulocks (even longer hash chains in the futex case).

In fulocks, in the hash table, you have nodes, you have one node per 
lock. As long as the number of  locks is low, you are fine. I have 
seen applications that have left one thousand of locks in the hash in 
a single run. Now the hash is a problem.

It's low priority though, somebody that does that deserves a well
tempered punch. 

> > But remember, futexes have no ownership concept; they are not a lock. Is
> > it ok to assume you are talking about another entity similar to futex
> > but with ownership concept?
> 
> No, I mean futexes as they are.  They do have an ownership concept:
> when userspace calls into the kernel futex_wake(), _then_ the owner is
> clearly the task doing the call.  That is the appropriate time to do

but it is not always the owner who unlocks a lock. POSIX claims this
as undefined, but everybody and their mum use it and allow it, so
we need to support it.

As well, you not only need the ownership concept, but the "am I locked?"
concept. You could argue the userspace word does that [that it does],
but for added robustness, once the kernel knows about the lock, it
takes precendence.

More on that: robustness: do_exit() needs to know which locks we
still own to mark them as dead and unlock the first waiter for him
to take care. 

As well, to maintain the prio inheritance semantics, when a task queues
up, it has to decide if it needs to boost the owner; for that, she needs
to access it.

To sum up: you need the concept of ownership to do many things.

> Do you mean your approach causes the Mars Rover problem, or mine does? :)

I mean yours would have solved it; damn JPL guys had to command the
rover to enable PI to avoid priority inversions :] hah, talk about
remote administration.

Boy this thread is getting long...

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)