OT : Approximate / fast math libraries ?

Wed Sep 5 00:32:45 UTC 2007

On Tue, 2007-09-04 at 18:17 -0500, Mike McCarty wrote:
> Matthew Saltzman wrote:
> > On Sat, 2007-09-01 at 09:41 -0500, Michael Hennebry wrote:
> > 
> > 
> >>How much precision do you need?  On what? Why?
> >>
> >>At least one person wrote a book on implementing the C standard library.
> >>It would probably be a better resource than Numerical Recipes.
> > 
> > 
> > That would be PJ Plauger's The Standard C Library, Prentice Hall, 1992
> > 0-13-131509-9.  Most of his math lib implementation is based on Cody and
> > Waite, Software Manual for the Elementary Functions, Prentice Hall, 1980
> > (sorry, he doesn't give the ISBN).  
> 
> My copy of Cody & Waite is ISBN 0-13-822064-6. The exact title
> is "Software Manual for the Elementary Functions".
> 
> I'm afraid I'm not very impressed with "Numerical Recipes".
> I bought a copy many years ago, and found some humorous lapses
> in the multi-precision FFT based math package. Things which
> proved that they don't know what they are doing, I'm afraid.
> Like subtracting one float from another repeatedly in a loop
> instead of using fmod().
> 
> I've had good results with Cody & Waite, though it's getting
> somewhat dated (1980) and some better stuff has come along,
> or so I've heard.
> 
> But, if the hardware is being used, then coding something with
> less accuracy is also going to be slower.
> 
> Mike
> -- 
I do have to agree with your assessment of their algorithms.  But having
a working algorithm means I only have to find the optimizations.  And
sometimes what seems archaic may be able to take advantage of compiler
and processor optimizations to achieve faster results.  The business of
subtraction is one cycle per subtraction, where as fmod is multiple
cycles to begin with plus call and return overhead.  If the iterative is
known to be some small number repeated subtraction may be faster.  Only
some practical work with the algorithm will tell you the real results.

	The same is true of Floating point operations vs integers.  When floats
had to be calculated by loops with an integer processor, they were
expensive and integers were faster.  Now with high speed floating point
units, simple float operations are quite fast if done in line.  Ditto
for doubles.  It costs no more to calculate doubles than singles except
when you store and retrieve them on a 32 bit machine (if your blocking
is set right.)  On a 64 bit machine, doubles may actually be faster
since you don't have to truncate or do the store offsets (note that this
depends on the hardware implementations inside the processor and the
microcode used to achieve the doubles and storage calculations).  With
some operations, the operation of the ram may be important, impacting
due to cyclic overhead.  Processors with high i/o bandwidth work well as
single cpu, but suffer a hit when in dual cpu due to inability to
overlap cycles as effectively.  They may well be faster, but it depends
a lot on the algorithm, and relative timing of the calculation and
results, some of which can be controlled by the programmer directly, and
some of which may be limited in the processor design.  Ditto for memory
access.  DDRR ram can do i/o overlap if the processor and mb electronics
can handle it.  So dual processors which have certain addressing setups
can both be full speed and overlapped if the other i/o functions can
support it.

	When discussing algorithm timing, only the algorithm being used and its
variants can be discussed for realtime applications.  This is one reason
that benchmarks are basically useless in choosing a processor for real
time applications, unless you are using the bench mark algorithm in your
specific application.

	I have spent many many hours optimizing code.  I have gradually come to
the conclusion that I can optimize my own code by 10 to 20 percent each
cycle I put into it, along with some additional benefit from hardware
advances in each cycle.  This is especially true now that the hardware
cycle is into the 13-14 month timeframe.

	Sometimes just taking a different perspective on the problem will help.
I have chased various trig functions for times then figured out a
relative way to achieve the same effective results with less overhead or
sometimes in hardware that cost less than $40.00. Using a phase
comparator to find angular offsets is one example.

	Sometimes a good step back, and a look at the ultimate goal will help
as well.  You may find that the results are more about the value of the
peak than the angular offset for example.  Or an FIR filter may be
faster than an fft in some cases to find a specific freq value.

There are more ways to tackle problems than most of us can imagine. That
is why we call upon the local wizards to help us.

Regards,
Les H