Extremely poor performance crunching random numbers under PIV-FC5

Fri May 19 15:38:27 UTC 2006

> Either have struct random_data randomdataState;
> and replace current uses of *randomdataState with randomdataState and
> currnet uses of randomdataState with &randomdataState, or initialize
> the pointer to an address of some struct random_data.

Ok. Now it is fix and runs without errors thanks tu Jakub and Andy help.

Unfortunately, the performance continues being very poor on the
FC5-PIV @ 3Ghz system, and only on it, but not the others (Opteron and
PIV @ 2.4Ghz).

I post the corrected full source code and the results here:

### test-cpu-2.c ################################################
	#include <stdio.h>
	#include <stdlib.h>
	#include <math.h>
	#include <time.h>
	#include <string.h>
	#include <fcntl.h>

	#ifdef linux
	inline void randomize(struct random_data *randomdataState) {
	#else
	inline void randomize() {
	#endif

	#ifdef linux
		static int buf[32];
	#else
	#endif
		time_t seconds;

		time(&seconds);
		srand((unsigned int) seconds);

	#ifdef linux
	    /* Se inicializa el generador especial de numeros aleatorios */
	    srand48((unsigned int) seconds);

		memset(randomdataState, 0, sizeof(*randomdataState));
		initstate_r(seconds, (char *) buf, 128, randomdataState);
	#else
	#endif
	}

	/* int main(int argc, char ** argv) { */
	int main() {
	    int i, r, numero_ciclos, numero_ciclosM;
	    clock_t start, end;
	    char* buf;
	    struct random_data randomdataState;

	    /* Se inicializa el generador de numeros aleatorios */
	    #ifdef linux
	    	randomize(&randomdataState);
	    #else
	    	randomize();
		#endif

	    start = clock();
	    /* Se reserva 0.1 Gb de memoria */
	    buf=malloc(100*1024*1024);
	    end = clock();
	    printf("Reservado 0.1 Gb de memoria en %.3f s.\n", (double)(end -
start)/CLOCKS_PER_SEC);

	    start = clock();
	    /* Se escribe en 0.1 Gb de memoria */
	    for(i=0; i<100*1024*1024; i++) {
	        buf[i]='0';
	    }
	    end = clock();
	    printf("Escritura sobre 0.1 Gb de memoria en %.3f s.\n",
(double)(end - start)/CLOCKS_PER_SEC);

	    numero_ciclos = 10000000; numero_ciclosM = numero_ciclos / 1E6;

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = rand();
	    }
	    end = clock();
	    printf("%d M de rand() en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = sqrt(i);
	    }
	    end = clock();
	    printf("%d M de sqrt(i) en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = log(i);
	    }
	    end = clock();
	    printf("%d M de log(i) en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = log10(i);
	    }
	    end = clock();
	    printf("%d M de log10(i) en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	#ifdef linux
	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = random();
	    }
	    end = clock();
	    printf("LINUX: %d M de random() en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	    	random_r(&randomdataState, &r);
		}
	    end = clock();
	    printf("LINUX: %d M de random_r() en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);

	    start = clock();
	    for(i=0; i<numero_ciclos; i++) {
	        r = lrand48();
	    }
	    end = clock();
	    printf("LINUX: %d M de lrand48() en %.3f s. (ejemplo.: %d)\n",
numero_ciclosM, (double)(end - start)/CLOCKS_PER_SEC, r);
	#else
	#endif

	    return (0);
	}
#################################################################

I compile the code this way in the FC5-PIV @ 3Ghz system:

	# gcc test-cpu-2.c -o randr-test-cpu-2 -lm -W -Wall -pedantic -O3

No errors, no warnings. And execute it:

	# ./randr-test-cpu-2
	Reservado 0.1 Gb de memoria en 0.000 s.
	Escritura sobre 0.1 Gb de memoria en 0.240 s.
	10 M de rand() en 46.640 s. (ejemplo.: 1867229032)
	10 M de sqrt(i) en 0.170 s. (ejemplo.: 3162)
	10 M de log(i) en 0.810 s. (ejemplo.: 16)
	10 M de log10(i) en 0.810 s. (ejemplo.: 6)
	LINUX: 10 M de random() en 38.630 s. (ejemplo.: 19070960)
	LINUX: 10 M de random_r() en 19.390 s. (ejemplo.: 1867229032)
	LINUX: 10 M de lrand48() en 31.610 s. (ejemplo.: 1479483981)

random_r() function is faster than rand(), more than twice faster.

That is better performance, but not enough, cause the results in the
PIV @ 2.4Ghz are these:

	# gcc test-cpu-2.c -o randr-test-cpu-2 -lm -W -Wall -pedantic -O3
	# ./randr-test-cpu-2
	Reservado 0.1 Gb de memoria en 0.000 s.
	Escritura sobre 0.1 Gb de memoria en 0.390 s.
	10 M de rand() en 0.410 s. (ejemplo.: 1589201696)
	10 M de sqrt(i) en 0.220 s. (ejemplo.: 3162)
	10 M de log(i) en 1.110 s. (ejemplo.: 16)
	10 M de log10(i) en 1.160 s. (ejemplo.: 6)
	LINUX: 10 M de random() en 0.330 s. (ejemplo.: 158326915)
	LINUX: 10 M de random_r() en 0.190 s. (ejemplo.: 1589201696)
	LINUX: 10 M de lrand48() en 0.580 s. (ejemplo.: 447468310)

Under Opteron system, similar results:

	# gcc test-cpu-2.c -o randr-test-cpu-2 -lm -W -Wall -pedantic -O3
	# ./randr-test-cpu-2
	Reservado 0.1 Gb de memoria en 0.000 s.
	Escritura sobre 0.1 Gb de memoria en 0.170 s.
	10 M de rand() en 0.160 s. (ejemplo.: 859117811)
	10 M de sqrt(i) en 0.120 s. (ejemplo.: 3162)
	10 M de log(i) en 1.060 s. (ejemplo.: 16)
	10 M de log10(i) en 1.220 s. (ejemplo.: 6)
	LINUX: 10 M de random() en 0.140 s. (ejemplo.: 304030109)
	LINUX: 10 M de random_r() en 0.080 s. (ejemplo.: 859117811)
	LINUX: 10 M de lrand48() en 0.140 s. (ejemplo.: 770314866)

Conclusions:

	1st.- Las oddity is resolved. It was due to a bug in my source code. Sorry!

	2nd.- random_r() function is hard to implement, but gives better
performance than rand() function

	3th.- random_r() function is outputing exactly the same random
numbers than rand() function. Look at the example results in the
tests. I don´t know if that is correct, reasonable or a possible
problem ...

	4th.- We still don´t know the origin of the extreme low performance
of random functions in the FC5-PIV @ 3 Ghz system.

	5th.- We suspect that the problem may be due to an odd bug that
appears when combining FC5 glibc (libc.so.6) version plus certain PIV
CPU's.

Already done unsuccesfully:

	1.- To activate/desactivate SELinux

	2.- To activate/desactivate swap

	3.- Try to use static /usr/lib/libm.a

	4.- Use alternate random function random_r()

	5.- Magic things down /proc/sys/kernel/

Next to do:

	1.- Andy told me to make my own bankhacker_random_r() function and
avoid glibc's (libc.so.6). I am going to work on it, but it is not
easy, I think.

	2.- Jakub told that "On PIV, atomic instructions are horribly
expensive. Either you have preloaded some library that called
pthread_create, or your CPU is unable to do the jump around lock
prefix trick quickly." It sounds very interesting but I don´t know how
to handle this ... any further explanation would be a great hint.
Thanks!

	3.- More ideas? Thanks honestly ...