Question regarding pthread_cancel and pthread_cond_timedwait

Fri Mar 25 07:03:36 UTC 2005

We have a threading library which has been in production for
six years and currently functions
on Solaris 2.6-2.9 Sparc, Solaris 2.7-2.10 x86, HP-UX 11.00,
Tru64 5.1(a,b), AIX 4.3.x and AIX 5.x.

The library starts up within the current process 5-8 threads,
the operation runs to completion (with or without error), the
threads complete or are canceled and then complete depending on
what happened during processing.

At some latter time this repeated N times without the main process 
exiting. The threads are NOT detached.

The problem occurs on Fedora Core 3 if thread has exited exited and 
pthread_cancel is called with a thread id of a thread which has completed.

If thread has exited and we call pthread_cancel with that thread id on 
Fedora Core 3
( version info
 getconf GNU_LIBPTHREAD_VERSION
 NPTL 2.3.4
 >uname -a
 Linux irl-73-26 2.6.10-1.770_FC3 #1 Thu Feb 24 14:00:06 EST 2005 i686 
i686 i386 GNU/Linux
)

the application segfaults.  Is this the expected behavior?

I am also getting a segfault when pthread_cond_timedwait is called, I 
still determining the
exact state when the segfault occurred. The back trace shows

#0  0x005c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00839dbc in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib/tls/libpthread.so.0

The directory listing shows:
ls -l /lib/tls/
total 1936
drwxr-xr-x  2 root root    4096 Mar 23 04:03 i486
drwxr-xr-x  2 root root    4096 Mar 23 04:03 i586
drwxr-xr-x  2 root root    4096 Mar 23 04:03 i686
-rwxr-xr-x  1 root root 1524828 Dec 21 02:04 libc-2.3.4.so
lrwxrwxrwx  1 root root      13 Mar 22 18:42 libc.so.6 -> libc-2.3.4.so
-rwxr-xr-x  1 root root  215272 Dec 21 02:04 libm-2.3.4.so
lrwxrwxrwx  1 root root      13 Mar 22 18:42 libm.so.6 -> libm-2.3.4.so
-rwxr-xr-x  1 root root  108560 Dec 21 02:04 libpthread-2.3.4.so
lrwxrwxrwx  1 root root      19 Mar 22 18:42 libpthread.so.0 -> 
libpthread-2.3.4.so
-rwxr-xr-x  1 root root   50984 Dec 21 02:04 librt-2.3.4.so
lrwxrwxrwx  1 root root      14 Mar 22 18:42 librt.so.1 -> librt-2.3.4.so
-rwxr-xr-x  1 root root   32308 Dec 21 02:04 libthread_db-1.0.so
lrwxrwxrwx  1 root root      19 Mar 22 18:42 libthread_db.so.1 -> 
libthread_db-1.0.so

Is this what   NPTL on Fedora Core 3 does TODAY?  or is there  a problem 
in  the sequence of releasing mutex's or condition variables that would 
cause this behavior in our code on Fedora Core 3.

We maintain internal thread exit status so I can skip cancelling the 
threads which have succesfully exited. We normally just cancel 
everything we started just
as a big hammer to make sure every thread shuts down and exits.  We can 
make the abort function a bit smarter since it has access to our 
internal thread status if need be.

On the OS's I mentioned above 0 is returned on success, on failure:

On HP-UX  11.00 pthread_cancel returns the value ERSCH, errno is NOT set.

On Solaris SPARC and x86 same as HP-UX 11.00

AIX same as HP-UX an Solaris.

On Tru64 pthread_cancel returns EINVAL or ESRCH, errno is not set.

Eric Bruno.