MLD2P4: a package of parallel algebraic multilevel preconditioners

Bologna, March 2008
MLD2P4: a package of parallel
algebraic multilevel Preconditioners
Pasqua D’Ambra, Institute for High-Performance Computing
and Networking (ICAR-CNR), Naples Branch, Italy
joint work with
Daniela di Serafino, Second University of Naples
Salvatore Filippone, University of Rome “Tor-Vergata”
Overview

Motivations
 Background
 Objectives

MLD2P4: Multi-Level Domain Decomposition
Parallel Preconditioners Package based on
PSBLAS
 Algorithms
and computational kernels
 Software architecture

Some Results & Applications
Pasqua D'Ambra - Bologna
March 2008
2
Background
Large-scale applications have to solve
Ax  b
The linear system matrix is:
Real or complex and square
Large and Sparse
Distributed among parallel processors
Matrix dimensions and entries, conditioning,
sparsity pattern and coupling among variables
vary along simulations
Pasqua D'Ambra - Bologna
March 2008
3
Background (cont’d)
What is the best method/preconditioner?
 No
absolute winner, experimentation is needed
 Reliable preconditioners require access to the complete
matrix
 Parallel implementation is not trivial
Interfacing with application software is required
 Custom-made
interfaces to parallel legacy codes
 Different interfaces for different
preconditioners/solvers
Pasqua D'Ambra - Bologna
March 2008
4
Objectives
designing and implementing a suite of
algebraic preconditioners
based on Linear Algebra kernels for
parallel sparse matrix computations

Flexibility


Portability & Efficiency


Different preconditioners by single API
Standard base software for serial kernels and data communications
Simplicity of usage

Modern (OO) Fortran 95 features and auxiliary routines for smooth
legacy code integration
Pasqua D'Ambra - Bologna
March 2008
5
MLD2P4
Multi-Level Domain Decomposition
Parallel Preconditioners Package based on PSBLAS
mld_prec_build(A,M,…)
A, distributed sparse matrix (input)
M, distributed sparse preconditioner (output)
Diagonal
Block-Jacobi
mld_prec_apply(M,x,y,…)
M, distributed sparse preconditioner (input)
x,y, distributed vectors (input/output)
Additive Schwarz
with arbitrary overlap
Algebraic
multi-level Schwarz
PSBLAS
Parallel Sparse Basic Linear Algebra Subprograms
Pasqua D'Ambra - Bologna
March 2008
6
BLACS
SBLAS
Basic Linear Algebra
Communication Subprograms
(Duff et al.)
F77
Parallel Sparse Matrix
Management
allocate, build, update, …
F95
Pasqua D'Ambra - Bologna
March 2008
MPI
Kernels
Parallel Sparse Matrix
Operations
matrix-matrix products, matrixvector products, …
Base sw
Iterative Sparse Linear Solvers
CG, BiCG, CGS, BiCGSTAB, RGMRES,…
Appl.
PSBLAS (Filippone et al., http://www.ce.uniroma2.it/psblas/)
Basic Linear Algebra Operations with Sparse Matrices
on MIMD Architectures
7
MLD2P4 Design
Algorithms
Algebraic multi-level Schwarz preconditioners
based on smoothed aggregation
 good trade-off between parallelism and convergence
 optimal scalability for symmetric positive-definite matrices
 algebraic framework allows general-purpose application
Pasqua D'Ambra - Bologna
March 2008
8
(1-lev) Schwarz: basic ingredients
Adjacency graph of A
G  W, E ,
W  1,2,3,..., n, E  i, j  : aij  0
0-overlap partition of W
Wi 0 , i  1,...,m,
partition of W
d-overlap partition of W
Wi δ  Wi δ 1 ,
j Wi δ  k Wi δ 1 :  j,k   E
A  n  n
symmetric sparsity pattern
1
2
















3



4
Pasqua D'Ambra - Bologna
March 2008
6
7
8
W110



1
W
W220
5





W1





9











 
9
1
2
3
4
5
6
7
8
9
AS: basic ingredients (cont’d)
Restriction/prolongation
operators

R  e j1 , e j2 ,..., e jn
δ
i

T
 
Piδ  R
, ji  W
δ T
i
Restriction of A
 
A iδ  R iδ A R
δ T
i
δ
i
1
2
















3



Pasqua D'Ambra - Bologna
March 2008
4



5
6
7
8
9














 
A11


A21





1
2
3
4
5
6
7
8
9
10
Coarse level correction: basic ingredients
Algebraic coarsening
uncoupled aggregation
Smoothed prol./restr. operators

1

PC  I  D A P,
RC  P
where P : WC  W
1, if (vert. i)  (aggr . j)
Pij  
0, otherwise
T
C
















Coarse-level matrix
A C  PC APCT  R TC AR C
Pasqua D'Ambra - Bologna
March 2008


































 

 

   
   
   



11
Multilevel-Schwarz
preconditioners & computational kernels
Example: 2-lev hybrid-post
build A iδ
build A C
aggregate : P : WC  W


mat x mat : R CT  PC  I  D 1 A P


M 2L1 H 2  M1L1  I  M1L1A M C1
mat x mat : A C  R C AR CT
restrict :
z  RC v
solve :
ACy  z
prol :
w C  R C  y
mat  vet : x  v  Aw C
T
AS prec : w 2L  M1L1 x
P. D’Ambra, D. di Serafino, S. Filippone, On the Development of PSBLAS-based Parallel Two-level Schwarz Preconditioners,
Applied Numerical Mathematics, 57, 2007.
Pasqua D'Ambra - Bologna
March 2008
12
Preconditioner Application
distributed & serial coarse
matrix solvers
Base sw
Preconditioner Build
prolongation, restriction, coarse
matrix, local sparse ILU and LU
PSBLAS 2.0
extended version of PSBLAS 1.0
Pasqua D'Ambra - Bologna
March 2008
Kernels
Parallel Preconditioners
BJA, ASM, RAS, ASH, ml-additive,
ml-hybridpre, ml-hybridpost, ml-symmhybrid
Appl.
MLD2P4 Design
Software Architecture
13
Performance Results & Comparisons



Different test matrices from various sources

thm matrices: thermal diffusion in solids

kivap matrices: automotive engine design

shipsec matrices: from UF sparse matrix collection
Experiments carried out on different Linux clusters

64 Intel Itanium dual-processor nodes connected by Quadrics QSNetII Elan 4

32 AMD Opteron dual-processor nodes connected by Myrinet

8 AMD Opteron dual-processor nodes connected by InfiniBand

8 Intel Itanium dual-processor nodes connected by Myrinet

16 Intel Pentium IV nodes connected by Fast Ethernet
Comparison with up-to-date related work

Trilinos-ML
A. Buttari, P. D’Ambra, D. di Serafino, S. Filippone, 2LEV-D2P4: a package of high-performance preconditioners for
scientific and engineering applications , Applicable Algebra in Engineering, Communication and Computing, Vol.
18, 2007.
Pasqua D'Ambra - Bologna
14
March 2008
Experimental Setting
MLD2P4: right-preconditioned BiCGSTAB

1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)

2-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.


Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI) or
with UMFPACK (2LDU) on diagonal blocks
3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.

Distributed coarsest matrix: 4 sweeps of block Jacobi with ILU(0) (3LDI) or
with UMFPACK (3LDU) on diagonal blocks
6
Stopping criterion: rk r0  10 or maxit
Unit right-hand side and null starting guess
Row-block distribution of matrices: # submatrices = # procs
Pasqua D'Ambra - Bologna
March 2008
15
thm matrices: number of iterations
thm1
np
n = 600000
nnz = 2996800
OV=0
RAS
2LDI
2LDU
3LDI
3LDU
1
613
190
-
70
-
2
705
184
-
72
-
4
761
206
-
74
-
8
688
202
44
67
28
16
748
211
61
70
36
32
766
186
81
69
64
809
196
113
86
np
OV=1
RAS
2LDI
2LDU
3LDI
3LDU
1
613
190
-
70
-
51
2
923
183
-
76
-
68
4
684
178
-
63
-
8
937
191
34
62
27
16
688
172
57
68
33
32
714
181
74
65
45
64
720
180
107
77
62
64 Intel Itanium dual-processor
nodes connected by QSNetII
Pasqua D'Ambra - Bologna
March 2008
16
thm matrices: execution times and speed-ups
(OV=1; best execution times:3LDU)
64 Intel Itanium dual-processor
nodes connected by QSNetII Pasqua D'Ambra - Bologna
March 2008
17
Application test case
large eddy simulation of incompressible
turbulent flows in a bi-periodical channel
main computational kernel
nonsymmetric and singular linear systems
arising from elliptic PDE with Neumann b.c.
A. Aprovitola, P. D’Ambra, F. M. Denaro, D. di Serafino, S. Filippone, Application of Parallel Algebraic
Multilevel Domain Decomposition Preconditioners in Large-Eddy Simulations of Wall-bounded Turbulent
Flows: First Experiments, RT-ICAR-NA-2007-02, July 2007.
Pasqua D'Ambra - Bologna
March 2008
18
Experimental Setting
Reynolds number: 180
Computational Grid: 140x32x45 non-uniform in
the y direction, time-step 10-4
Pressure linear system
n=201600
nnz=1398600
MLD2P4: right-preconditioned RGMRES(30)

1-lev Restricted Additive Schwarz preconditioner with ILU(0) (RAS)

2-lev/3-lev hybrid Schwarz preconditioner, with RAS/ILU(0) as 1-lev prec.
Distributed coarse matrix: 4 sweeps of block Jacobi with ILU(0) (2LDI/3LDI)
on diagonal blocks
 Stopping criterion: r r  107 or maxit
k
0
 General row-block distribution

Pasqua D'Ambra - Bologna
March 2008
19
LES of incompressible wall-bounded flow
SOR on 1 proc.=9 sec.
16 Intel Itanium dual-processor
nodes connected by QSNetII
Pasqua D'Ambra - Bologna
March 2008
SOR on 1 proc.=8580 sec.
20
Work in progress

Package available on the web very soon

More sophisticated aggregation algorithms

Integration of preconditioners and solvers
in large-scale applications
Pasqua D'Ambra - Bologna
March 2008
21