• Nie Znaleziono Wyników

of space-time problems

N/A
N/A
Protected

Academic year: 2021

Share "of space-time problems"

Copied!
37
0
0

Pełen tekst

(1)

of space-time problems

Day 3’

Department of Computer Science

AGH University of Science and Technology, Kraków, Poland home.agh.edu.pl/paszynsk

1 / 37

(2)

Isogeometric finite element method

Alternating Directions Implicit (ADI) method Isogeometric L2 projections

Explicit dynamics

Example 1: Heat transfer Installation of IGA-ADS solver

Parallel shared memory explicit dynamics

Example 2: Non-linear flow in heterogenous media Parallel distributed memory explicit dynamics Example 3: Linear elasticity

Implicit dynamics

Example 4: Implicit heat transfer Example 5: Pollution problem Labs with implict dynamics

2 / 37

(3)

Parallel version for distributed-memory machines Message Passing Interface (MPI)

Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Cube of processors. Gathers and scathers RHSs in three faces to perform three 1D solves with multiple RHS.

3 / 37

(4)

Message Passing Interface (MPI) subroutine Gather

(F, F_out, n, elems, stride, dims, shifts, comm, ierr) call Linearize(F_out_lin,F_out,n+1,stride)

call mpi_gatherv (F_lin, ←data to send by each processor 1, .., Nx

elems * stride, ←size of data assigned to this processor MPI_DOUBLE_PRECISION, ←type of data

F_out_lin, ←data to receive by processor 0 in a row dims, ←size of data from entire row

shifts, ←offsets of slices from all processor from a row MPI_DOUBLE_PRECISION, ←type of data to receive

0, COMM, ierr) ←communicator along a row call Delinearize(F_out_lin,F_out,n+1,stride)

4 / 37

(5)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Gather and scatter data into three faces on cube of processors

5 / 37

(6)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Gather into OYZ face

0. Integration

1a. Gather in every row of processors into OYZ face 1b. Solve NyNz 1D problems with multiple RHS

1c. Scatter results onto cube of processors 1d. Reorder right hand sides

6 / 37

(7)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Gather into OXZ face

7 / 37

(8)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Gather and scatter data into three faces on cube of processors

2a. Gather in every row of processors into OXZ face 2b. Solve NxNz 1D problem with multiple RHS

2c. Scatter results onto cube of processors 2d. Reorder right hand sides

8 / 37

(9)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure: Gather into OXY face

9 / 37

(10)

Parallel version for distributed-memory machines (Fortran, MPI) Maciej Woźniak, Marcin Łoś, Maciej Paszyński, Lisandro Dalcin, Victor Calo, Parallel Fast Isogeometric Solvers for Explicit Dynamics, Computing and Informatics 36(2) (2017)

Figure:Gather and scatter data into three faces on cube of processors

3a. Gather in every row of processors into OXY face 3b. Solve NxNy 1D problem with multiple RHS

3c. Scatter results onto cube of processors 3d. Reorder right hand sides

10 / 37

(11)

We have a mesh of Nx × Ny × Nz elements

There are (px+ 1)(py+ 1)(pz+ 1) basis functions over each element Integration over all elementsO(p2xp2ycpz2NxNyNz

xcycz ) Solution (1D problem multiple RHS)O

(px2cx+py2cy+p2zcz)(NxNyNz) cxcycz



GatherO(cx+cyc+cz)NxNyNz

xcycz



ReorderONxNycNzpxpypz

xcycz



ScatherO(cx+cyc+cz)NxNyNz

xcycz



Assuming

Nx = Ny = Nz= N1/3, px = py = pz = p cx = cy = cz = c1/3 we have the following costp6cN +pc22/3N +p3cNtcomp+cN2/3

tcomm

which implies the computational complexity Op6 Nc and the communication complexity OcN2/3



11 / 37

(12)

Numerical experiments LONESTAR Linux cluster

20 100 1000 2300

2 10

Time[s]

c3 20

100 1000 2300

2 10

Time[s]

c3

theoretical total

experimental total

Figure:Comparison of total experimental and theoretical execution time for N = 512 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

12 / 37

(13)

The integration time is dominating the solution time significantly.

10 100 1000 3000

2 10

Time[s]

c3 10

100 1000 3000

2 10

Time[s]

c3

theoretical integration experimental integration

Figure:Comparison of experimental and theoretical integration time for N = 512 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

13 / 37

(14)

The solution time takes around 1 percent of the total solver time.

0.05 0.1 1 2

2 10

Time[s]

c3 0.05

0.1 1 2

2 10

Time[s]

c3 0.05

0.1 1 2

2 10

Time[s]

c3 0.05

0.1 1 2

2 10

Time[s]

c3 theoretical solve

experimental solve 1 experimental solve 2

experimental solve 3

Figure:Comparison of experimental and theoretical solution times for N = 512 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

14 / 37

(15)

0.03 0.1 1 10 30

2 10

Time[s]

c3 0.03

0.1 1 10 30

2 10

Time[s]

c3 0.03

0.1 1 10 30

2 10

Time[s]

c3 0.03

0.1 1 10 30

2 10

Time[s]

c3 0.03

0.1 1 10 30

2 10

Time[s]

c3 theoretical gather 1

theoretical gather 2,3

experimental gather 1 experimental gather 2

experimental gather 3

Figure:Comparison of experimental and theoretical gather times for N = 512 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

15 / 37

(16)

0.03 0.1 1

2 10

Time[s]

c3 0.03

0.1 1

2 10

Time[s]

c3 0.03

0.1 1

2 10

Time[s]

c3 0.03

0.1 1

2 10

Time[s]

c3 theoretical scatter

experimental scatter 1

experimental scatter 2 experimental scatter 3

Figure:Comparison of experimental and theoretical scatter times for N = 512 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

16 / 37

(17)

Numerical experiments LONESTAR Linux cluster

100 1000 5500

3 10 12

Time[s]

c3 100

1000 5500

3 10 12

Time[s]

c3

theoretical total

experimental total

Figure:Comparison of total experimental and theoretical execution time for N = 1024 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

17 / 37

(18)

The integration time is dominating the solution time significantly.

70 100 1000 6000

3 10 12

Time[s]

c3 70

100 1000 6000

3 10 12

Time[s]

c3

theoretical integration

experimental integration

Figure:Comparison of total experimental and estimated integration time for N = 1024 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

18 / 37

(19)

The solution time takes around 1 percent of the total solver time.

0.25 1 10

3 10 12

Time[s]

c3 0.25

1 10

3 10 12

Time[s]

c3 0.25

1 10

3 10 12

Time[s]

c3 0.25

1 10

3 10 12

Time[s]

c3 theoretical solve

experimental solve 1

experimental solve 2 experimental solve 3

Figure:Comparison of total experimental and estimated solution times for N = 1024 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

19 / 37

(20)

0.15 1 10 70

3 10 12

Time[s]

c3 0.15

1 10 70

3 10 12

Time[s]

c3 0.15

1 10 70

3 10 12

Time[s]

c3 0.15

1 10 70

3 10 12

Time[s]

c3 0.15

1 10 70

3 10 12

Time[s]

c3 theoretical gather 1

theoretical gather 2,3

experimental gather 1

experimental gather 2

experimental gather 3

Figure:Comparison of total experimental and estimated gather times for N = 1024 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

20 / 37

(21)

0.2 1 5

3 10 12

Time[s]

c3 0.2

1 5

3 10 12

Time[s]

c3 0.2

1 5

3 10 12

Time[s]

c3 0.2

1 5

3 10 12

Time[s]

c3

model scatter experimental scatter 1

experimental scatter 2

experimental scatter 3

Figure:Comparison of total experimental and estimated scatter times for N = 1024 for p = 3 for different number of processors

c3= 23, . . . , 103= 8, . . . , 1000.

21 / 37

(22)

Time step size limited by Courrant-Fredrichs-Levy (CFL) condition https://en.wikipedia.org/wiki/Courant-Friedrichs-Lewy_condition

ux∆t

∆x + u∆yy∆t ≤ Cmax

where ∆t is time step size, ∆x = ∆y = h element size, ux, uy magnitude of the field, Cmax = 1 for explicit method, so (ux+ uy) ∗ ∆t ≤ h

Figure:Lack of convergence for Dt = 10−4,102−4,..., 105−4

Figure:Convergence for Dt = 10−5 and smaller time steps

22 / 37

(23)

Click in the middle

23 / 37

(24)

Click in the middle

24 / 37

(25)

If we go for implicit method, we need to integrate the matrix We have a mesh of Nx × Ny × Nz elements

nrdof = (px + 1)(py+ 1)(pz+ 1) basis functions on each element (px, py, pz) denotes the B-splines order in directions x , y and z ngx = O(px), ngy = O(py), ngz = O(pz) number of Gauss points (Bi ,px Bj,py Bk,pz , Bl ,px Bm,py Bn,pz )L2 =

R

Bi ,px (x )Byj,p(y )Bk,pz (z)Bl ,px (x )Bm,py (y )Bn,pz (z)dxdydz = P

E

R

EBi ,px (x )Bj,py (y )Bk,pz (z)Bl ,px (x )Bm,py (y )Bn,pz (z)dxdydz = P

E

P

s=1,ngx ;t=1,ngy ;w =1,ngz

WsWtWzBi ,px (xs)Bj,py (yt)Bk,pz (zw)Bl ,px (xs)Bm,py (yt)Bn,pz (zw) We construct the element matrices for each element, for all the basis functions which span over the element, namely

Bi ,px i = 1, px + 1, Bj,py j = 1, py + 1, Bk,pz k = 1, pz+ 1 (trial functions) and Bl ,px l = 1, px+ 1, Bm,py m = 1, py+ 1, Bn,pz n = 1, pz+ 1 (test functions).

25 / 37

(26)

We have a mesh of Nx × Ny × Nz elements

nrdof = (px + 1)(py+ 1)(pz+ 1) basis functions on each element (px, py, pz) denotes the B-splines order in directions x , y and z ngx = O(px), ngy = O(py), ngz = O(pz) number of Gauss points

for s = 1, ngx // Gauss integration points for t = 1, ngy

for w = 1, ngz

get Gauss point (xs, yt, zw), weight WsWtWw

for l = 1, px + 1 // trial B-splines for m = 1, py+ 1

for n = 1, pz+ 1

for j = 1, px+ 1 // test B-splines for j = 1, py + 1

for k = 1, pz+ 1 aggregate

W ∗ Bi ,px (xs)Bj,py (xt)Bk,pz (xw), Bl ,px (xs)Bm,py (xt)Bn,pz (xw) px = py = pz = p computational complexity O(p9), if p=9 it is 109

26 / 37

(27)

F = 0.d 0

do ex = 1,nelemx //Loop through elements do ey = 1,nelemy

do ez = 1,nelemz

J = Jx(ex)*Jy(ey)*Jz(ez) //element Jacobian do kx = 1,ngx //Loop through Gauss points

do ky = 1,ngy do kz = 1,ngz

W = Wx(kx)*Wy(ky)*Wz(kz) //Gauss weight value = fvalue(Xx(kx,ex),Xy(ky,ey),Xz(kz,ez)) do ax = 0,px //B-splines

do ay = 0,py do az = 0,pz

do bx = 0,px //B-splines do by = 0,py

do bz = 0,pz

call compute_index(ind,ax,ay,az,ex,ey,ez,nx,ny,nz) call compute_index(ind1,bx,by,bz,ex,ey,ez,nx,ny,nz)

A(ind,ind1) = A(ind,ind1) +

NNx(0,ax,kx,ex)*NNy(0,ay,ky,ey)*NNz(0,az,kz ,ez)*

NNx(0,bx,kx,ex)*NNy(0,by,ky,ey)*NNz(0,bz,kz ,ez)*J*W*value

27 / 37

(28)

OpenMP = Open Multi-Processing

!$OMP PARALLEL DO

!$OMP& DEFAULT(SHARED)

!$OMP& FIRSTPRIVATE

(iy,ex,ey,ez,J,kx,ky,kz,W,value,ax,ay,az,bx,by,bz,ind,ind1)

!$OMP& REDUCTION(+:nr_nonzeros) do iy=1,miy //Now it is 1 loop over elements

call map_indexes(iy,ex,ey,ez)

J = Jx(ex)*Jy(ey)*Jz(ez) //element Jacobian do kx = 0,ngx //loop through Gauss points

do ky = 0,ngy do kz = 0,ngz

W = Wx(kx)*Wy(ky)*Wz(kz) //Gauss weight value = fvalue(Xx(kx,ex),Xy(ky,ey),Xz(kz,ez)) do ax = 0,px //trail B-splines along x,y,z

do ay = 0,py do az = 0,pz

do bx = 0,px //test B-splines along x,y,z do by = 0,py

do bz = 0,pz

call compute_index(ind,ax,ay,az,ex,ey,ez,nx,ny,nz) call compute_index(ind1,bx,by,bz,ex,ey,ez,nx,ny,nz) A(ind,ind1) = A(ind,ind1) +

NNx(0,ax,kx,ex)*NNy(0,ay,ky,ey)*NNz(0,az,kz ,ez)*

NNx(0,bx,kx,ex)*NNy(0,by,ky,ey)*NNz(0,bz,kz ,ez)*J*W*value

!$OMP END PARALLEL DO

28 / 37

(29)

1 2 3 4 5 6

5 10 15 20 25 30 35 40

Computation time

Cores

p=2, Ne=41

Figure:Execution time of the parallel integration algorithm, when increasing number of cores. 3D element with quadratic polynomials

29 / 37

(30)

5 10 15 20 25 30 35 40 45 50

5 10 15 20 25 30 35 40

Computation time

Cores

p=3, Ne=42

Figure:Execution time of the parallel integration algorithm, when increasing number of cores. 3D element with cubic polynomials

30 / 37

(31)

50 100 150 200 250

5 10 15 20 25 30 35 40

Computation time

Cores

p=4, Ne=43

Figure:Execution time of the parallel integration algorithm, when increasing number of cores. 3D element with quartic polynomials

31 / 37

(32)

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 10 15 20 25 30 35 40

Efficiency

Cores

p=2, Ne=41

Figure:Efficiency of the parallel integration algorithm, when increasing number of cores. 3D element with quadratic polynomials

32 / 37

(33)

20%

30%

40%

50%

60%

70%

80%

90%

100%

5 10 15 20 25 30 35 40

Efficiency

Cores

p=3, Ne=42

Figure:Efficiency of the parallel integration algorithm, when increasing number of cores. 3D element with cubic polynomials

33 / 37

(34)

40%

50%

60%

70%

80%

90%

100%

5 10 15 20 25 30 35 40

Efficiency

Cores

p=4, Ne=43

Figure:Efficiency of the parallel integration algorithm, when increasing number of cores. 3D element with quartic polynomials

34 / 37

(35)

1 2 3 4 5 6 7 8

5 10 15 20 25 30 35 40

Speedup

Cores p=2, Ne=41

Figure:Speedup of the parallel integration algorithm, when increasing number of cores. 3D element with quadratic polynomials

35 / 37

(36)

2 4 6 8 10 12 14

5 10 15 20 25 30 35 40

Speedup

Cores p=3, Ne=42

Figure:Speedup of the parallel integration algorithm, when increasing number of cores. 3D element with cubic polynomials

36 / 37

(37)

2 4 6 8 10 12 14 16 18

5 10 15 20 25 30 35 40

Speedup

Cores p=4, Ne=43

Figure:Speedup of the parallel integration algorithm, when increasing number of cores. 3D element with quartic polynomials

37 / 37

Cytaty

Powiązane dokumenty

Cresseid sins against the code of courtly love in that she first deserts and betrays Troilus (which happens in Chaucer’s poem), then goes into “the court commoun” (Testament, l.

Porządek dzienny obej­ mował szereg szczególnie trudnych i częściowo już na poprzednich konferencjach rozważanych zagadnień, a mianowicie: w Komisji 1-szej (przewodniczący

Incomplete Cholesky preconditioned CG, or IC-CG, is not, however, an ideal iteration, as is well-known for the single-phase flow case, where the number of iterations required to

rozwiązania kwestii żydowskiej ograniczyła się wlas'nie do wspomnianej kom i­ sji” (s. Pomijają jednak, że komisja w krótkim czasie opracowała pod jego kierownictwem

Het programma DRIFCUR voor het voorspellen van het gedrag van afgemeerde tankers in extreme golfkondities (frequentiedomein). De programma's DRIFCUR+ en DIFFRAC+ voor de berekening

executed after every simulation step void after_step() override implementation of generation of RHS void compute_rhs() override executed once after the simulation ends. void

Parallel version for shared-memory machines (C++ GALOIS) Marcin Łoś, Maciej Woźniak, Maciej Paszyński, Andrew Lenharth, Keshav Pingali IGA-ADS : Isogeometric Analysis FEM using

executed after every simulation step void after_step() override implementation of generation of RHS void compute_rhs() override.. These cases are marked in equations (28) and (29)