Optimization techniques
Carlo Cavazzoni, HPC department, CINECA
www.cineca.it
Modern node architecture
CPU
cache
I, D
Small & fast
Disk
www.cineca.it
RAM
Cache
Hierarchy register  L1  L2  L3  RAM
L1: Instruction and data
Size: L1  …  Ln
Speed: L1  …  Ln
CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss)
and looks in L2 …
cache miss  penaly in terms of clock cycle
www.cineca.it
CACHE Direct Mapped
0
32 Kbyte
32 K
32 Kbyte
32 Kbyte
64 K
32 Kbyte
cache
128 K
32 Kbyte
www.cineca.it
Cache set associative
0
16 K
32 Kbyte
32 K
16 Kbyte
32 Kbyte
LastRecentlyUsed
64 K
Round Robin
Random
32 Kbyte
16 Kbyte
128 K
Es. 2-ways
cache
www.cineca.it
48 K
32 Kbyte
Loop optimization
www.cineca.it
Loop fusion
Locality in time
do i=1, n
a(i) = b(i) + 1.0
enddo
do i=2, n
c(i) = sqrt(a(i-1))
enddo
if n is big enough, a is
loaded,
offloaded
and
loaded again into cache
www.cineca.it
do i=2, n
a(i) = b(i) + 1.0
c(i) = sqrt(a(i-1))
enddo
a(1) = b(1) + 1.0
Reuse the a(i) loaded into
cache
Loop interchange
Locality in space
do i=1, n
do j=1, n
do j=1, n
a(i,j) = b(i,j) + 1.0
enddo
enddo
do i=1, n
a(i,j) = b(i,j) + 1.0
enddo
enddo
a
0x00
0x01
0x02
0x03
b
j
i
j
i
Load elements into cache lines and
use only one before replacing them
with new elements
www.cineca.it
Load elements into cache and use all
of them before replacing them with
new elements
Cache thrashing
real, dimension (1024) :: a,b
COMMON /my_com/ a, b
COMMON /my_com/ a, b
do i=1, 1024
a(i) = b(i) + 1.0
enddo
size cache = 4*1024, direct mapped,
a, b contiguous  cache thrashing
array size = multiple of cache size 
possible source of cache thrashing
www.cineca.it
integer offset =
(linea_cache)/SIZE(REAL)
real, dimension (1024+offset) :: a,b
do i=1, 1024
a(i) = b(i) + 1.0
enddo
Set Associative 
thrashing problems
Padding
help
reducing
offset  shift matrixes
w.r.t. cache  no more
problems
 Avoid power of 2 for
matrix dimensions
Loop unrolling
do j=1, n
do j=1, n
do i=1, (n-1)
a(i,j)= b(i,j)+b(i+1,j)+1.0
enddo
enddo
do i=1, (n-1), 2
a(i,j)
= b(i,j) +b(i+1,j)+1.0
a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0
enddo
enddo
Equivalent Loops.
Fewer jump.
Fewer dependencies.
Fill pipelines and vector units.
www.cineca.it
Optimize with
numerical libraries
Less coding
Tested and (almost) bug free
Standard
Efficient implementation
Optimized
www.cineca.it
BLAS
Basic Linear Algebra Subprogram, Parallel BLAS and Basic
Linear Algebra Communication Subsystem (www.netlib.org)
• Level 1 BLAS: Vector-Vector operations
(scalar only).
• Level 2 BLAS, PBLAS: Vector-Matrix
operations (scalar and parallel).
• Level 3 BLAS, PBLAS: Matrix-Matrix
operation (scalar and parallel).
• Level 1 and 2 BLACS: vector reduction,
vector and matrics communications.
www.cineca.it
Lapack and Scalapack
Linear Algebra Package and Scalable Lapack
(www.netlib.org)




Matrix decomposition.
Solution of Linear Systems.
Eigenvalues and Eigenvetors
Linear Least Square solutions
www.cineca.it
MKL
ESSL
ACML
CUBLAS
MAGMA
PLASMA
www.cineca.it
MASS (IBM)
• Accelerated version of SQRT, SIN, COS,
EXP, LOG, ecc…
 Scalar and vector
www.cineca.it
VML
Equivalent to MASS (vector version only)
For Intel processors
Accelerated version of:
sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2,
sinh, cosh, tanh, dnint, x**y
www.cineca.it
VML
do i = 1, n
r = r + sin( a(i) )
end do
call vdsin( n, a, y )
do i = 1, n
r = r + y( i )
end do
CALL vml_subroutine( n, a, y )
www.cineca.it
BLAS
Matrix multiplication
DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc)
c = alpha op( a ) * op( b ) + beta c
real*8 a(lda,*), b(ldb,*), c(ldc,*)
Clm = n Aln Bnm +  Clm
Clm = n ATln Bnm +  Clm
Clm = n Aln BTnm +  Clm
Clm = n ATln BTnm +  Clm
www.cineca.it
Profileing with gprof
Compiler flag “-pg” or “-p” (depend on the compiler)
gcc -pg –c mio.c
./a.out
gmon.out
www.cineca.it
gprof
gcc -pg -funroll-loops –O2 dotprod.c -static
[cineca@rfxoff1 Carlo]$ ./a.out
d = 1000000.000000
gprof





%
cumulative
time
seconds
68.57
0.05
31.43
0.07
0.00
0.07
www.cineca.it
self
seconds
0.05
0.02
0.00
calls
2
1
1
self
total
us/call us/call
23437.50 23437.50
21484.38 21484.38
0.00 68359.38
name
set_vector
dot_product
main
Profileing “by hand”
CALL CPU_TIME( t3 )
Find “hot spot” in your application
CALL critical_subroutine( …… )
CALL CPU_TIME( t4 )
Use temporization functions
PRINT *, (t4-t3)
CALL SYSTEM_CLOCK(iclk1, count_rate=nclk)
t1 = cclock()
CALL critical_subroutine( …… )
CALL critical_subroutine( …… )
CALL SYSTEM_CLOCK(iclk2)
t2 = cclock()
PRINT *,REAL(iclk2-iclk1)/nclk
PRINT *, (t2-t1)
www.cineca.it
Mesure performances
#include<stdio.h>
#include<time.h>
#include<ctype.h>
#include<sys/types.h>
#include<sys/time.h>
double cclock_()
{
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp;
double sec;
gettimeofday( &tmp, (struct timezone *)0 );
sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;
return sec;
}
www.cineca.it
PROGRAM test_dgemm
PROGRAM test_dgemm
IMPLICIT NONE
IMPLICIT NONE
INTEGER, PARAMETER :: dim = 1000
REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:)
INTEGER :: i,j,k
REAL*8 :: t1, t2
REAL*8 :: cclock
EXTERNAL :: cclock
ALLOCATE( x( dim, dim ), y( dim, dim ) )
ALLOCATE( z( dim, dim ) )
y = 1.0d0
z = 1.0d0 / DBLE( dim )
x = 0.0d0
t1 = cclock( )
do j = 1, dim
do i = 1, dim
do k = 1, dim
x(i,j) = x(i,j) + y(i,k) * z(k,j)
end do
end do
end do
t2 = cclock()
write(*,*) ' Matrix sum = ', sum(x)
write(*,*) ' tempo (secondi) ', t2-t1
DEALLOCATE( x, y, z )
INTEGER, PARAMETER :: dim = 1000
REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:)
INTEGER :: i,j,k
REAL*8 :: t1, t2
REAL*8 :: cclock
EXTERNAL :: cclock
ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) )
END PROGRAM
www.cineca.it
y = 1.0d0
z = 1.0d0 / DBLE( dim )
x = 0.0d0
t1 = cclock()
! x = matmul( y, z )
call dgemm('N', 'N', dim, dim, dim, 1.0d0, y,
c
dim, z, dim,0.0d0, x, dim)
t2 = cclock()
write(*,*) ' Matrix sum = ', sum(x)
write(*,*) ' tempo (secondi) ', t2-t1
DEALLOCATE( x, y, z )
END PROGRAM
ATLAS
Automatically Tuned Linear Algebra Software
http://sourceforge.net/
http://math-atlas.sourceforge.net/devel/
BLAS compatible
www.cineca.it
http://www.fftw.org
Fast Fourier Trasform
FFT complex to complex
FFT complex to real
Parallel FFT
Moulti-thread FFT
www.cineca.it
Advanced techniques
www.cineca.it
Case Study: matrix transposition
do i=1,n
do j=1,m
y(j,i) = x(i,j)
enddo
enddo
x
y
Think Fortran: Consecutive elements in memory
www.cineca.it
What happens with the cache
Suppose 2-way set associative
y
data mapped in cache
y “allocate” the 1st way
www.cineca.it
x
For each value of x I need to
load into cache a whole line.
1) X “allocate” the 2nd way.
2) Risk of thrashing
3) When the cache is full, the
proc. Start to overwrite cache
lines
What happens with the cache, cont.
Suppose 2-way set associative
As before for each value of
x I need to load into cache
a whole cache line.
x
y
data mapped in cache
y “allocate” the 1st way
We can see that a lot of data are loaded
into the cache but they are not used!
www.cineca.it
Block Algorithm
Suppose 2-way set associative
y
Load a block of data into
cache
Write data back to memory
Swap data in cache
www.cineca.it
x
Solution: Block algorithm
do i=1,n
do j=1,m
y(j,i) = x(i,j)
enddo
enddo
bsiz = block size
nb
= n / bsiz
mb
= m / bsiz
You need to handle:
MOD(n / bsiz) /= 0 OR
MOD(m / bsiz) /= 0
www.cineca.it
do ib = 1, nb
ioff = (ib-1) * bsiz
do jb = 1, mb
joff = (jb-1) * bsiz
do j = 1, bsiz
do i = 1, bsiz
buf(i,j) = x(i+ioff, j+joff)
enddo
enddo
do j = 1, bsiz
do i = 1, j-1
bswp = buf(i,j)
buf(i,j) = buf(j,i)
buf(j,i) = bswp
enddo
enddo
do i=1,bsiz
do j=1,bsiz
y(j+joff, i+ioff) = buf(j,i)
enddo
enddo
enddo
enddo
Whole block transpose
do ib = 1, nb
ioff = (ib-1) * bsiz
do jb = 1, mb
joff = (jb-1) * bsiz
IF( min( 1, MOD(n,bsiz) ) .GT. 0 ) THEN
do j = 1, bsiz
ioff = nb * bsiz
do i = 1, bsiz
do jb = 1, mb
buf(i,j) = x(i+ioff,j+joff)
joff = (jb-1) * bsiz
enddo
do j = 1, bsiz
enddo
do i = 1, MIN(bsiz, n-ioff)
do j = 1, bsiz
buf(i,j) = x(i+ioff, j+joff)
do i = 1, j-1
enddo
bswp = buf(i,j)
enddo
buf(i,j) = buf(j,i)
do i = 1, MIN(bsiz, n-ioff)
buf(j,i) = bswp
do j = 1, bsiz
enddo
y(j+joff,i+ioff) = buf(i,j)
enddo
enddo
do i=1,bsiz
enddo
do j=1,bsiz
enddo
y(j+joff,i+ioff) = buf(j,i)
END IF
enddo
enddo
enddo
enddo
2
1
www.cineca.it
IF( MIN(1, MOD(m, bsiz)) .GT. 0 ) THEN
joff = mb * bsiz
do ib = 1, nb
ioff = (ib-1) * bsiz
do j = 1, MIN(bsiz, m-joff)
do i = 1, bsiz
buf(i,j) = x(i+ioff, j+joff)
enddo
enddo
do i = 1, bsiz
do j = 1, MIN(bsiz, m-joff)
y(j+joff,i+ioff) = buf(i,j)
enddo
enddo
enddo
END IF
IF( MIN(1,MOD(n,bsiz)).GT.0 .AND. &
& MIN(1,MOD(m,bsiz)).GT.0 ) THEN
joff = mb * bsiz
ioff = nb * bsiz
do j = 1, MIN(bsiz, m-joff)
do i = 1, MIN(bsiz, n-ioff)
buf(i,j) = x(i+ioff, j+joff)
enddo
enddo
do i = 1, MIN(bsiz, n-ioff)
do j = 1, MIN(bsiz, m-joff)
y(j+joff,i+ioff) = buf(i,j)
enddo
enddo
END IF
3
Performance tuning and analysis: user codes
Matrix Trasposition
Matrix size: 2048x2048
Straightforward
implementation
0.50
0.45
Block
implementation
execution time
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
20
40
60
block size
www.cineca.it
80
100
120
Parameter Dependent Code & Unrolling
DO l=1,nphase
IF(au1(l,l) /= 0.D0) THEN
lp1=l+1
div=1.D0/au1(l,l)
DO lj=lp1,nphase
au1(l,lj)=au1(l,lj)*div
END DO
bu1(l)=bu1(l)*div
au1(l,l)=0.D0
DO li=1,nphase
amul=au1(li,l)
DO lj=lp1,nphase
au1(li,lj)=au1(li,lj)-amul*au1(l,lj)
END DO
bu1(li)=bu1(li)-amul*bu1(l)
END DO
END IF
END DO
per un dato set di parametri
(di input), riesco ad eliminare
ogni loop, ottimizzando cache
e pipe di esecuzione
www.cineca.it
IF( a(1,1) /= 0.D0 ) THEN
div = 1.D0 / a(1,1)
a(1,2) = a(1,2) * div
a(1,3) = a(1,3) * div
b(1)
= b(1)
* div
a(1,1) = 0.D0
!li=2
amul = a(2,1)
a(2,2) = a(2,2) - amul
a(2,3) = a(2,3) - amul
b(2)
= b(2)
- amul
!li=3
amul = a(3,1)
a(3,2) = a(3,2) - amul
a(3,3) = a(3,3) - amul
b(3)
= b(3)
- amul
END IF
* a(1,2)
* a(1,3)
* b(1)
* a(1,2)
* a(1,3)
* b(1)
IF( a(2,2) /= 0.D0 ) THEN
div=1.D0/a(2,2)
a(2,3)=a(2,3)*div
b(2)=b(2)*div
a(2,2)=0.D0
!li=1
amul=a(1,2)
a(1,3)=a(1,3)-amul*a(2,3)
b(1)=b(1)-amul*b(2)
!li=3
amul=a(3,2)
a(3,3)=a(3,3)-amul*a(2,3)
b(3)=b(3)-amul*b(2)
END IF
IF( a(3,3) /= 0.D0 ) THEN
div=1.D0/a(3,3)
b(3)=b(3)*div
a(3,3)=0.D0
!li=1
amul=a(1,3)
b(1)=b(1)-amul*b(3)
!li=2
amul=a(2,3)
b(2)=b(2)-amul*b(3)
END IF
Debugging (post mortem)
gfortran –g hello_bug.f90
program hello_bug
real(kind=8) :: a( 10 )
call clearv( a, 10000 )
print *, SUM( a )
end program
subroutine clearv( a, n)
real(kind=8) :: a( * )
integer :: n
integer :: i
do i = 1, n
a( n ) = 0.0
end do
end subroutine
www.cineca.it
Remove core size limit
ulimit –c unlimited
./a.out
Segmentation fault (core dumped)
gdb ./a.out core
Debugging (in vivo)
gfortran -g hello_bug.f90
gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /plx/userinternal/acv0/a.out...done.
(gdb) run
Starting program: /plx/userinternal/acv0/a.out
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:12
12
a( n ) = 0.0
www.cineca.it
Link
Fortran and C
#include<sys/types.h>
#include<sys/time.h>
double cclock_()
{
/* Restituisce il valore del CLOCK di sistema in secondi */
struct timeval tmp;
double sec;
gettimeofday( &tmp, (struct timezone *)0 );
sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0;
return sec;
}
1) gcc –c cclock.c
program mat_mul
integer, parameter :: n = 100
real*8 :: a(n,n), b(n,n), c(n,n)
real*8 :: t1, t2
real*8 :: cclock
external cclock
a = 1.0d0
b = 1.0d0
t1 = cclock()
call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n )
t2 = cclock()
write(*,*) SUM(c), t2-t1
end program
2) f95 matmul_prof.f90 cclock.o -L. -lblas
www.cineca.it
Link Fortran and C
Link a C subroutine with a Fortran program
rand.f90
program rand
real(kind=8) :: a
external crand
call crand( a )
print *,'this is random ', a
end program
crand.c
#include<stdlib.h>
#include<time.h>
void crand_( double * x )
{
(*x) = ( (double)random()/(double)RAND_MAX );
}
Fortran passes arguments by reference, C passes them by value
www.cineca.it
Link Fortran and C
Link a Fortran subroutine with a C program
cvec.c
#include<stdio.h>
int main()
{
int n;
double a[10], d;
n = 10;
d = 1.0;
setv_( a, &d, &n );
printf("%lf\n", a[0]);
}
www.cineca.it
vset.f90
subroutine setv( a, d, n )
real(kind=8) :: a( * )
real(kind=8) :: d
integer :: n
integer :: i
do i = 1, n
a( i ) = d
end do
end subroutine
Make Command
If a code is large and/or it shares subroutines with
other codes, it is useful to split the source in many
files that could be placed in different directories.
In F90 there are dependencies among program units,
i.e. modules must be compiled before than any other program units.
Therefore there is a well defined order for compiling source files
To avoid compiling by hands the sources in the proper order,
the make command could be used
www.cineca.it
Make Command
The make command can be programmed to do the job for you
using a file containing instruction and dircetive.
By default the make command looks in the present directory
for a file colled Makefile or makefile
www.cineca.it
A simple makefile
# this is a comment within the makefile
myprog.x : modules.o main.o
f90 –o myprog.x modules.o main.o
this tell to the make command
that myprog.x depend from
modules.o and main.o
modules.o : modules.f90
f90 –c modules.f90
main.o : modules.o main.f90
f90 –c main.f90
make execute the command only
when modules.o and main.o
have been built
to compile the code, from the console the programmer issue
the command:
> make
www.cineca.it
A less simple makefile
# this is a comment within the makefile
myprog.x : modules.o main.o
f90 –o myprog.x modules.o main.o
main.o : modules.o
.f90.o
f90 –c $<
this is an implicit dependency, it state that all
files “.o” depend and should be generated from
the corresponding “.f90” files
this is a make macro, and it is expandend with the proper
“.f90” filename
In the above example, make try to built myprog.x but it realizes that main.o and modules.o should be
generated first. Then it starts looking for a rule to make the “.o”, and it finds that main.o depend on
modules.o, and thern make build an internal hierarchy for compilation in which modules.o come before
main.o . Finally make finds the implicit rule and starts compiling the sources.
www.cineca.it