Optimization techniques Carlo Cavazzoni, HPC department, CINECA www.cineca.it Modern node architecture CPU cache I, D Small & fast Disk www.cineca.it RAM Cache Hierarchy register L1 L2 L3 RAM L1: Instruction and data Size: L1 … Ln Speed: L1 … Ln CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 … cache miss penaly in terms of clock cycle www.cineca.it CACHE Direct Mapped 0 32 Kbyte 32 K 32 Kbyte 32 Kbyte 64 K 32 Kbyte cache 128 K 32 Kbyte www.cineca.it Cache set associative 0 16 K 32 Kbyte 32 K 16 Kbyte 32 Kbyte LastRecentlyUsed 64 K Round Robin Random 32 Kbyte 16 Kbyte 128 K Es. 2-ways cache www.cineca.it 48 K 32 Kbyte Loop optimization www.cineca.it Loop fusion Locality in time do i=1, n a(i) = b(i) + 1.0 enddo do i=2, n c(i) = sqrt(a(i-1)) enddo if n is big enough, a is loaded, offloaded and loaded again into cache www.cineca.it do i=2, n a(i) = b(i) + 1.0 c(i) = sqrt(a(i-1)) enddo a(1) = b(1) + 1.0 Reuse the a(i) loaded into cache Loop interchange Locality in space do i=1, n do j=1, n do j=1, n a(i,j) = b(i,j) + 1.0 enddo enddo do i=1, n a(i,j) = b(i,j) + 1.0 enddo enddo a 0x00 0x01 0x02 0x03 b j i j i Load elements into cache lines and use only one before replacing them with new elements www.cineca.it Load elements into cache and use all of them before replacing them with new elements Cache thrashing real, dimension (1024) :: a,b COMMON /my_com/ a, b COMMON /my_com/ a, b do i=1, 1024 a(i) = b(i) + 1.0 enddo size cache = 4*1024, direct mapped, a, b contiguous cache thrashing array size = multiple of cache size possible source of cache thrashing www.cineca.it integer offset = (linea_cache)/SIZE(REAL) real, dimension (1024+offset) :: a,b do i=1, 1024 a(i) = b(i) + 1.0 enddo Set Associative thrashing problems Padding help reducing offset shift matrixes w.r.t. cache no more problems Avoid power of 2 for matrix dimensions Loop unrolling do j=1, n do j=1, n do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddo enddo do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddo enddo Equivalent Loops. Fewer jump. Fewer dependencies. Fill pipelines and vector units. www.cineca.it Optimize with numerical libraries Less coding Tested and (almost) bug free Standard Efficient implementation Optimized www.cineca.it BLAS Basic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org) • Level 1 BLAS: Vector-Vector operations (scalar only). • Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel). • Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel). • Level 1 and 2 BLACS: vector reduction, vector and matrics communications. www.cineca.it Lapack and Scalapack Linear Algebra Package and Scalable Lapack (www.netlib.org) Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions www.cineca.it MKL ESSL ACML CUBLAS MAGMA PLASMA www.cineca.it MASS (IBM) • Accelerated version of SQRT, SIN, COS, EXP, LOG, ecc… Scalar and vector www.cineca.it VML Equivalent to MASS (vector version only) For Intel processors Accelerated version of: sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y www.cineca.it VML do i = 1, n r = r + sin( a(i) ) end do call vdsin( n, a, y ) do i = 1, n r = r + y( i ) end do CALL vml_subroutine( n, a, y ) www.cineca.it BLAS Matrix multiplication DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc) c = alpha op( a ) * op( b ) + beta c real*8 a(lda,*), b(ldb,*), c(ldc,*) Clm = n Aln Bnm + Clm Clm = n ATln Bnm + Clm Clm = n Aln BTnm + Clm Clm = n ATln BTnm + Clm www.cineca.it Profileing with gprof Compiler flag “-pg” or “-p” (depend on the compiler) gcc -pg –c mio.c ./a.out gmon.out www.cineca.it gprof gcc -pg -funroll-loops –O2 dotprod.c -static [cineca@rfxoff1 Carlo]$ ./a.out d = 1000000.000000 gprof % cumulative time seconds 68.57 0.05 31.43 0.07 0.00 0.07 www.cineca.it self seconds 0.05 0.02 0.00 calls 2 1 1 self total us/call us/call 23437.50 23437.50 21484.38 21484.38 0.00 68359.38 name set_vector dot_product main Profileing “by hand” CALL CPU_TIME( t3 ) Find “hot spot” in your application CALL critical_subroutine( …… ) CALL CPU_TIME( t4 ) Use temporization functions PRINT *, (t4-t3) CALL SYSTEM_CLOCK(iclk1, count_rate=nclk) t1 = cclock() CALL critical_subroutine( …… ) CALL critical_subroutine( …… ) CALL SYSTEM_CLOCK(iclk2) t2 = cclock() PRINT *,REAL(iclk2-iclk1)/nclk PRINT *, (t2-t1) www.cineca.it Mesure performances #include<stdio.h> #include<time.h> #include<ctype.h> #include<sys/types.h> #include<sys/time.h> double cclock_() { /* Restituisce il valore del CLOCK di sistema in secondi */ struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec; } www.cineca.it PROGRAM test_dgemm PROGRAM test_dgemm IMPLICIT NONE IMPLICIT NONE INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do end do end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z ) INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) ) END PROGRAM www.cineca.it y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock() ! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim) t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z ) END PROGRAM ATLAS Automatically Tuned Linear Algebra Software http://sourceforge.net/ http://math-atlas.sourceforge.net/devel/ BLAS compatible www.cineca.it http://www.fftw.org Fast Fourier Trasform FFT complex to complex FFT complex to real Parallel FFT Moulti-thread FFT www.cineca.it Advanced techniques www.cineca.it Case Study: matrix transposition do i=1,n do j=1,m y(j,i) = x(i,j) enddo enddo x y Think Fortran: Consecutive elements in memory www.cineca.it What happens with the cache Suppose 2-way set associative y data mapped in cache y “allocate” the 1st way www.cineca.it x For each value of x I need to load into cache a whole line. 1) X “allocate” the 2nd way. 2) Risk of thrashing 3) When the cache is full, the proc. Start to overwrite cache lines What happens with the cache, cont. Suppose 2-way set associative As before for each value of x I need to load into cache a whole cache line. x y data mapped in cache y “allocate” the 1st way We can see that a lot of data are loaded into the cache but they are not used! www.cineca.it Block Algorithm Suppose 2-way set associative y Load a block of data into cache Write data back to memory Swap data in cache www.cineca.it x Solution: Block algorithm do i=1,n do j=1,m y(j,i) = x(i,j) enddo enddo bsiz = block size nb = n / bsiz mb = m / bsiz You need to handle: MOD(n / bsiz) /= 0 OR MOD(m / bsiz) /= 0 www.cineca.it do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo enddo enddo enddo Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN do j = 1, bsiz ioff = nb * bsiz do i = 1, bsiz do jb = 1, mb buf(i,j) = x(i+ioff,j+joff) joff = (jb-1) * bsiz enddo do j = 1, bsiz enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz buf(i,j) = x(i+ioff, j+joff) do i = 1, j-1 enddo bswp = buf(i,j) enddo buf(i,j) = buf(j,i) do i = 1, MIN(bsiz, n-ioff) buf(j,i) = bswp do j = 1, bsiz enddo y(j+joff,i+ioff) = buf(i,j) enddo enddo do i=1,bsiz enddo do j=1,bsiz enddo y(j+joff,i+ioff) = buf(j,i) END IF enddo enddo enddo enddo 2 1 www.cineca.it IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo enddo END IF IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo enddo END IF 3 Performance tuning and analysis: user codes Matrix Trasposition Matrix size: 2048x2048 Straightforward implementation 0.50 0.45 Block implementation execution time 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 20 40 60 block size www.cineca.it 80 100 120 Parameter Dependent Code & Unrolling DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO per un dato set di parametri (di input), riesco ad eliminare ogni loop, ottimizzando cache e pipe di esecuzione www.cineca.it IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul a(2,3) = a(2,3) - amul b(2) = b(2) - amul !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul a(3,3) = a(3,3) - amul b(3) = b(3) - amul END IF * a(1,2) * a(1,3) * b(1) * a(1,2) * a(1,3) * b(1) IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2) END IF IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3) END IF Debugging (post mortem) gfortran –g hello_bug.f90 program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a ) end program subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end do end subroutine www.cineca.it Remove core size limit ulimit –c unlimited ./a.out Segmentation fault (core dumped) gdb ./a.out core Debugging (in vivo) gfortran -g hello_bug.f90 gdb ./a.out GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /plx/userinternal/acv0/a.out...done. (gdb) run Starting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000 Program received signal SIGSEGV, Segmentation fault. 0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:12 12 a( n ) = 0.0 www.cineca.it Link Fortran and C #include<sys/types.h> #include<sys/time.h> double cclock_() { /* Restituisce il valore del CLOCK di sistema in secondi */ struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec; } 1) gcc –c cclock.c program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1 end program 2) f95 matmul_prof.f90 cclock.o -L. -lblas www.cineca.it Link Fortran and C Link a C subroutine with a Fortran program rand.f90 program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', a end program crand.c #include<stdlib.h> #include<time.h> void crand_( double * x ) { (*x) = ( (double)random()/(double)RAND_MAX ); } Fortran passes arguments by reference, C passes them by value www.cineca.it Link Fortran and C Link a Fortran subroutine with a C program cvec.c #include<stdio.h> int main() { int n; double a[10], d; n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); } www.cineca.it vset.f90 subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end do end subroutine Make Command If a code is large and/or it shares subroutines with other codes, it is useful to split the source in many files that could be placed in different directories. In F90 there are dependencies among program units, i.e. modules must be compiled before than any other program units. Therefore there is a well defined order for compiling source files To avoid compiling by hands the sources in the proper order, the make command could be used www.cineca.it Make Command The make command can be programmed to do the job for you using a file containing instruction and dircetive. By default the make command looks in the present directory for a file colled Makefile or makefile www.cineca.it A simple makefile # this is a comment within the makefile myprog.x : modules.o main.o f90 –o myprog.x modules.o main.o this tell to the make command that myprog.x depend from modules.o and main.o modules.o : modules.f90 f90 –c modules.f90 main.o : modules.o main.f90 f90 –c main.f90 make execute the command only when modules.o and main.o have been built to compile the code, from the console the programmer issue the command: > make www.cineca.it A less simple makefile # this is a comment within the makefile myprog.x : modules.o main.o f90 –o myprog.x modules.o main.o main.o : modules.o .f90.o f90 –c $< this is an implicit dependency, it state that all files “.o” depend and should be generated from the corresponding “.f90” files this is a make macro, and it is expandend with the proper “.f90” filename In the above example, make try to built myprog.x but it realizes that main.o and modules.o should be generated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend on modules.o, and thern make build an internal hierarchy for compilation in which modules.o come before main.o . Finally make finds the implicit rule and starts compiling the sources. www.cineca.it