[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Fast thread-local storage for OpenGL drivers



On Mon, Feb 24, 2003 at 10:48:01AM -0800, Gareth Hughes wrote:
> > For the dispatch tables I even remember suggesting to:
> > ...
> > b) in addition to that, you can build an .a library with the above 5 lines
> >    per .o file's source plus .hidden Foo which would make apps/libraries
> >    using openGL even faster (as they wouldn't hop through PLT, which is
> >    one memory load and indirect jump through the loaded value) at the expense
> >    of making offset_Foo part of the openGL ABI (which as far as I understood
> >    already is anyway because of the binary modules).
> > c) or you could inline the calls
> 
> By default, these are forbidden by the GNU/Linux OpenGL ABI.

I'm not claiming they should be the default. The question is if it can
be done as optimization explicitely requested by the application.
E.g. linking with -lGLfast which would be a linker script with
GROUP (libGLfast_nonshared.a libGL.so)

> > In the May thread, I'm pretty sure you mentioned __indirect* routines
> > which are the biggest part of libGL.so are rarely used, which means the
> > definitely should be compiled with -fpic, the rest if it is really
> > performance critical can be put into awx sections using
> > __attribute__((section("..."))).
> 
> Sorry, I'm not quite sure what you mean here...

http://sources.redhat.com/ml/libc-alpha/2002-05/msg00158.html

libGL.so is the single library where DT_TEXTREL or not really matters a lot.
The drivers are dlopened, while libGL.so is linked to lots of apps, even
those which either never do GL or use it only very rarely.
For them, the price of not being prelink(8)able and have
to resolve 12krelocations or how many at each program startup
slows things considerably down, plus the fact that with DT_TEXTREL
ld.so has to mprotect all the shared library as writeable,
do the relocations which effectively make the whole shared library
not shared anymore and mprotect it back readonly.

If the above mentioned mail says that __indirect calls are fallback
and not used in performance critical paths, then they surely should be
compiled with -fpic, which cuts down the number of relocations considerably.
Also, if using gcc >= 3.2 (quite common these days in Linux distributions,
unlike in May 2002 when the above mail exchange happened), it should use
#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC__ >= 2)
# define GLprivate __attribute__((visibility("hidden")))
#else
# define GLprivate /* Nothing */
#endif
and use it in declaration of functions which are private to libGL.so
yet needed by different .o files libGL.so is linked from. E.g. I believe
the vast majority of __indirect_* functions are:

void GLprivate __indirect_glFlush(void);

in indirect.h would mark __indirect_glFlush that way.

Then there are performance critical functions in libGL which you want
to compile with -fno-pic because you really need %ebx register for other
things (note that with GLprivate used where it makes sense less things
will require it with -fpic, as e.g. most of the calls will not have to
go through PLT). I think the dispatcher falls into this category, maybe
a few routines besides it, you know the code not me.
For those, you should put them into separate .c files so that you never
mix performance critical functions (to be compiled with -fno-pic on arches
which allow it) with the rest.
Then you could define something like:
#if __GNUC__ >= 2 && defined __i386__ /* Perhaps name here other arches which allow -fno-pic shlib at all */
# define GLnopic __attribute__((section ("openGL_wtext")))
# define GLdeclnopic __asm__(".section openGL_wtext, \"awx\"; .previous");
#else
# define GLnopic /* Nothing */
# define GLdeclnopic /* Nothing */
#endif

with usage:

void GLnopic GLdispatchme (void);

(either in the .h header with prototypes, or when defining the function),
and

GLdeclnopic

in one of the .c files which are compiled with -fpic (and thus don't
define any GLnopic function).

When linking the shared library, you'd need to create a special linker
script for it:

$CC $CFLAGS -shared -Wl,--verbose 2>&1 \
  | LC_ALL=C sed -e '/^=========/,/^=========/!d;/^=========/d' \
    -e 's/[[:blank:]]\.data[[:blank:]]/openGL_wtext : { *(openGL_wtext) } &/' \
    > libGL.so.lds
which would then be used when linking libGL.so:
$CC ... -shared -o libGL.so.xxx -Wl,-T,libGL.so.lds ...

This way, you can make libGL.so which will load quickly,
be shareable among applications and will be fast where it matters
(benchmarking can prove this, or you can find out which other function
is important to have -fno-pic, move it and add GLnopic).

Then as far as I remember is there another source of big number
of relocations which IMHO is not used in performance critical places
(at least last time I looked at libGL, which is in August) -
static_functions array. 3 relocations per entry and the array is very big.
You can optimize this easily if each line of the table is not:
{ "glNewList", (GLvoid *) glNewList, _gloffset_NewList },
but STATIC_FUNCTION ( "glNewList", glNewList, _gloffset_Newlist )
(or maybe just STATIC_FUNCTION (NewList) )
and then there rest of things is hidden in the macros (with accessor
macros for code like get_static_proc_{offset,address} and
_glapi_get_proc_name).
The macros could expand to how the code looks like ATM on most arches
and on say IA-32 it could instead be _GLOBAL_OFFSET_TABLE_ relative
encoded, so that there are no dynamic relocations for the whole array
(this is @gotoff in IA-32 asm), on x86-64 and maybe other arches
it could be PC relative and use 4 bytes instead of 8 bytes per
pointer, etc. If the array was small, this would be overkill, but
in my eyes when we're talking about ~2000 relocations for one
array it is worth doing something for it.

I'd recommend reading http://people.redhat.com/drepper/dsohowto.pdf
too.

	Jakub





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]