• Nie Znaleziono Wyników

Computational Performance

N/A
N/A
Protected

Academic year: 2021

Share "Computational Performance"

Copied!
10
0
0

Pełen tekst

(1)

Analysis and modeling of

Computational Performance

(2)

Krzysztof Banaś Computational Performance 2

Numerical linear algebra

A set of algorithms for basic operations on vectors and matrices

the de facto standard BLAS (basic linear algebra subroutines) for low level procedures

• unified interface for many operations

linear combinations, summations, scalar products, norms, matrix- vector products, matrix-matrix products, etc.

• three data types: single precision, double precision, complex

• three levels

BLAS level 1 – vector operations, O(n) complexity

BLAS level 2 – matrix-vector operations, O(n2) complexity

BLAS level 3 – matrix-matrix operations, O(n3) complexity

• many implementations, usually provided by hardware vendors

the second de facto standard LAPACK (linear algebra package)

• extension of BLAS for more advanced algorithms

solution of systems of linear equations, eigenvalue problems, different matrix decompositions

(3)

The importance of linear algebra

(4)

Krzysztof Banaś Computational Performance 4

Numerical linear algebra

BLAS level 1 and BLAS level 2 procedures have their performance usually memory bounded

arithmetic intensity O(1)

important example procedures:

• dot: α = xT x

• axpy: y = α x+y – for vector multiplication with a scalar, vector addition, linear combination

• gemv: y = α A x + β y – used for matrix-vector multiplication

matrix-vector product is a basic block of iterative solvers

BLAS level 3 procedures

some procedures with arithmetic intensity O(n)

important example procedure:

• gemm: C = α A B + β C – matrix(-matrix) multiplication

matrix multiplication is a basic block of many other algorithms, including direct solvers for dense systems of linear equations

(5)

Matrix-matrix multiplication

Matrix multiplication C = A B

2n3 operations for square, size n, matrices

for infinite number of registers (or infinite cache size)

• 3n2 memory accesses (or 4 if assumed reading of C)

optimal arithmetic intensity

• spm = (2n3)/(3n2) ~ 2n/3

naive implementation -->

• spm = (2n3)/(ln3+....) ~ 2/l

• not only large number of memory accesses, but also accesses to b in inner loop with stride n

very low performance

naive implementation that follows mathematical

notation:

cij = Σk aik bkj

row major storage:

c(row, col) = c[row*n + col]

for(i=0;i<n;i++){

for(j=0;j<n;j++){

c[i*n+j]=0.0;

for(k=0;k<n;k++){

c[i*n+j] += a[i*n+k]*b[k*n+j];

} } }

(6)

Krzysztof Banaś Computational Performance 6

Matrix-matrix multiplication

Common sense parallel

implementation

no apparent errors

• initialization of c moved to

separate loops

• loop interchange for the second loop

optimizations left to compiler

still low arithmetic intensity

• optimizations required for the three main loops

#pragma omp parallel default(none) \\

shared(a,b,c,n) private(i,j,k) {

#pragma omp for for(i=0;i<n;i++){

for(j=0;j<n;j++){

c[i*n+j]=0.0;

} }

#pragma omp for for(i=0;i<n;i++){

for(k=0;k<n;k++){

for(j=0;j<n;j++){

c[i*n+j] += a[i*n+k]*b[k*n+j];

} } } }

(7)

Matrix-matrix multiplication

Register blocking (with factors Bi, Bj, Bk)

the small inner loops can be manually unrolled

vector registers and vector operations can be used to implement inner loops – producing vector register reuse

for(ii=0; ii<n; ii+=B){

for(kk=0; kk<n; kk+=B){

for(jj=0; jj<n; jj+=B){

c[i*n+j] += a[i*n+k]*b[k*n+j] +

a[i*n+k+1]*b[k*n+n+j] + a[i*n+k+2]*b[k*n+2*n+j] + ... ;

c[i*n+j+1] += a[i*n+k]*b[k*n+j+1] + ...

// etc.

}}}

for(ii=0; ii<n; ii+=Bi){

for(kk=0; kk<n; kk+=Bk){

for(jj=0; jj<n; jj+=Bj){

register int i, j, k;

for(i=ii; i<ii+Bi; i++){

for(k=kk; k<kk+Bk; k++){

for(j=jj; j<jj+Bj; j++){

c[i*n+j] += a[i*n+k]*b[k*n+j];

}}}}}}

(8)

Krzysztof Banaś Computational Performance 8

Matrix-matrix multiplication

Cache blocking

loops over matrix entries are split into several levels

• outer levels become loops over blocks

const int bls=108; // to fit L2 cache

#pragma omp parallel for default(none) shared(a,b,c,n,bls) private(i,j,k) for(i=0;i<n;i+=bls){

for(j=0;j<n;j+=bls){

for(k=0;k<n;k+=bls){

• the innermost loops, over entries inside blocks, can be further optimized, using e.g. register blocking etc.

register int kk,jj,ii;

for(ii=i;ii<i+bls;ii+=...){

for(kk=k;kk<k+bls;kk+=...){

for(jj=j;jj<j+bls;jj+=...){

c[ii*n+jj] += a[ii*n+kk]*b[kk*n+jj] + ...

}}}}}}

(9)

Matrix-matrix multiplication

One of many variants of cache blocking

each element of

block of C reused n times

each element (row) of block of A reused BLS times

each element

(column) of block of B reused BLS times

arithmetic intensity spm =

O((2n3)/(2n3/BLS))

(10)

Krzysztof Banaś Computational Performance 10

Matrix-matrix multiplication

Matrix multiplication is probably the most intensively optimized function (in the world)

classical optimizations are usually done by compilers

register blocking is used to reuse registers

vectorization allows for using vector operations and reusing vector registers

cache blocking increases data locality and allows for reusing data in cache memory

• there can be several levels of cache blocking for different levels of cache

more sophisticated optimizations aim at optimal usage of memory pages and TLB caches

the overall effect can be checked by

• assembler inspection

• measurements done by profilers, possibly using hardware counters

after all, the only thing that matters is execution time

Cytaty

Powiązane dokumenty

» entry found, no physical page assigned -&gt; page fault (both minor and major faults lead to page table and TLB update).  physical

Joint participation of Polish and Canadian delegation in ICSC in Vietnam also showed that, despite ideological differences, their members were not only able to work, but also

[r]

This accelerator consists on a data and control memory interface (to communicate directly with the main memory), a master and slave interface to the bus and the necessary interfaces

This paper is mainly concerned with the process of test de- velopment and application of fault detecting tests for caches, applied during the structural testing stage of the

He brings the tone of ambivalence into Malory’s vision of knighthood: while he fails to follow the basic rules of chivalry he is described as a favourite companion loved by all

We can also observe that a user will face less blocking if the number of available channels K is smaller, or if users leave infrequently (high Q P 2P T V ),. while changing the

Institute of Theoretical Physics, University of Warsaw, Ho˙za 69, 00-681 Warsaw, Poland 2 Faculty of Chemistry, A. Mickiewicz University, Grunwaldzka 6, 60-780