Index of /rozprawy2/11720

(1)

Wydział Informatyki, Elektroniki i Telekomunikacji

K

ATEDRA

I

NFORMATYKI

P

RACA DOKTORSKA

G

RZEGORZ

G

URGUL

P

ROJEKTOWANIE ROZPROSZONYCH ALGORYTMÓW

SOLWERÓW ZMIENNO

-

KIERUNKOWYCH DO OBLICZE ´

N

IZOGEOMETRYCZNEJ METODY ELEMENTÓW

SKO ´

NCZONYCH

P

ROMOTOR

:

prof. dr hab. Maciej Paszy´nski

P

ROMOTOR POMOCNICZY

:

dr hab. in˙z. Bartosz Bali´s, prof. uczelni

(2)

AGH

University of Science and Technology in Krakow

Faculty of Computer Science, Electronics and Telecommunications

D

EPARTMENT OF

C

OMPUTER

S

CIENCE

DISSERTATION FOR THE DEGREE OF

D

OCTOR OF

P

HILOSOPHY

G

RZEGORZ

G

URGUL

D

ESIGNING DISTRIBUTED ALTERNATING

-

DIRECTIONS

SOLVERS FOR

I

SOGEOMETRIC

F

INITE

E

LEMENT

M

ETHOD

A

NALYSIS

S

UPERVISOR

:

Professor Maciej Paszy´nski, Ph.D.

C

O

-

SUPERVISOR

:

Bartosz Bali´s, Ph.D., Associate Professor

(3)

I am sincerely grateful to Professor Maciej Paszy´nski for his

dedica-tion and patience displayed at every step of my research. I appreciate

the support I have received from the Polish National Science Centre

from the grant no. DEC-2015/19/B/ST8/01064, Principal Investigator

Professor Danuta Szeliga.

(4)

Dedicated to my parents who

continu-ously supported me throughout every

milestone of my education.

(5)

Streszczenie

Analiza izogeometryczna, której celem jest ujednolicenie projektowania wspieranego komputerowo (CAD) i zasad in˙zynierii wspieranej komputerowo (CAE), szybko stała si˛e przedmiotem badań wielu naukow-ców. Opracowali oni w ci ˛agu ostatnich lat wiele szybkich solwerów dokładnych. Na szczególn ˛a uwag˛e zasługuj ˛a algorytmy solwera zmiennokierunkowego, dekomponuj ˛ace macierz układu równań pochodz ˛acego z dyskretyzacji dwu- lub trójwymiarowych problemów obliczeniowych na produkt Kroneckera dwóch lub trzech macierzy wie-loprzek ˛atniowych. Dzi˛eki strukturze wieloprzek ˛atniowej tych macierzy solwery zmiennokierunkowe cechuj ˛a si˛e – w przypadku problemów dwuwymiarowych – liniow ˛a zło˙zono´sci ˛a obliczeniow ˛a w wersji sekwencyjnej oraz logarytmiczn ˛a w wersji równoległej, co w naturalny sposób skierowało uwag˛e naukowców w stron˛e poszukiwania implementacji potrafi ˛acych wykorzystać t˛e ostatni ˛a wła´sciwo´sć. W zwi ˛azku z tym zostały opracowane solwery wykorzystuj ˛ace pasmow ˛a struktur˛e macierzy jednowymiarowych oraz dekompozycj˛e na podstawowe zadania ob-liczeniowe, oparte na gramatykach grafowych, które ułatwiaj ˛a zrównoleglenie obliczeń. Dostrze˙zono jednocze´snie konieczno´sć opracowania frameworku, który, nawet je˙zeli pomin ˛ać oczywiste wzgl˛edy praktyczne, umo˙zliwiłby innym badaczom, a zwłaszcza tym bezpo´srednio niezwi ˛azanym z projektowaniem wysokowydajnych solwerów, wykorzystanie analizy izogeometrycznej do przeprowadzania praktycznych symulacji w obszarze ich własnej spe-cjalizacji naukowej. Wydaje si˛e to zasadne równie˙z w sytuacji uwzgl˛ednienia faktu, ˙ze analiza izogeometryczna wymaga wiedzy z dwóch historycznie ró˙znych dziedzin nauki, co – jak zauwa˙zył twórca analizy izogeometrycznej T. J. R. Hughes – mo˙ze stanowić istotny problem w szerszej popularyzacji tej˙ze metody. Istniej ˛ace rozwi ˛azania działały w ´srodowisku pami˛eci współdzielonej lub na klastrach linuksowych o pami˛eci współdzielonej, co ogra-niczało mo˙zliwo´sć rozwi ˛azywania wi˛ekszych problemów lub wr˛ecz j ˛a udaremniało, je´sli rozwi ˛azywany problem był tak du˙zy, ˙ze nie mie´scił si˛e w pami˛eci pojedynczego w˛ezła obliczeniowego, którym dysponował naukowiec. Istniej ˛a praktyczne problemy in˙zynieryjne, które z natury wymagaj ˛a przeprowadzania niestacjonarnych symulacji na du˙zych siatkach obliczeniowych, co skutkuje konieczno´sci ˛a rozwi ˛azania du˙zych układów równań, licz ˛acych niekiedy miliardy niewiadomych. Nale˙z ˛a do nich, mi˛edzy innymi, symulacje przeprowadzane na danych kartogra-ficznych, takie jak szczególnie popularne w ostatnim czasie symulacje rozchodzenia si˛e zanieczyszczeń powietrza czy przewidywania pogody. Na domiar złego istniej ˛ace algorytmy solwerów zmiennokierunkowych bazuj ˛a wy-ł ˛acznie na metodach dyskretyzacji czasu typu explicite, które – b˛ed ˛ac mniej kosztownymi obliczeniowo od metod typu implicite – wymagaj ˛a u˙zywania g˛estszej dyskretyzacji wraz ze wzrostem rozmiaru siatki obliczeniowej w celu zachowania stabilno´sci. Istnienie tego typu zale˙zno´sci w wersji explicite algorytmu solwera zmiennokierunkowego wymaga obliczenia tysi˛ecy zb˛ednych kroków czasowych, co znacz ˛aco wydłu˙za czas obliczeń w przypadku i tak ju˙z wymagaj ˛acych problemów obliczeniowych.

W tej pracy zaproponowano algorytmy dekompozycji niestacjonarnych problemów in˙zynierskich na sekwencje układów równań o strukturze produktu Kroneckera, co umo˙zliwiło opracowanie algorytmów solwerów zmienno-kierunkowych o liniowej zło˙zono´sci obliczeniowej. Opracowana metoda dekompozycji niestacjonarnych proble-mów in˙zynierskich pozwala na uzyskanie schematów całkowania po czasie implicite, a wi˛ec na stosowanie dłu˙z-szych kroków dyskretyzacji czasowej z zachowaniem bezwzgl˛ednej stabilno´sci obliczeń, co mo˙ze zostać wykorzy-stane do znacznego skrócenia czasu symulacji. Ponadto, klasyczny sposób rozwi ˛azywania problemów zwi ˛azanych z metod ˛a elementów skończonych bazuje przede wszystkim na dekompozycji domeny na odizolowane podpro-blemy obliczeniowe, które s ˛a nast˛epnie rozwi ˛azywane współbie˙znie na specjalnych klastrach obliczeniowych, do których interfejsem jest niskopoziomowy standard Message Passing Interface (MPI). Standard ten, pomimo ci ˛agle jeszcze widocznej dominacji w ´srodowisku naukowym, raczej stopniowo traci na popularno´sci poza nim, w sytu-acji istnienia publicznych chmur obliczeniowych, które oferuj ˛a du˙zo wy˙zsz ˛a elastyczno´sć prowadzenia obliczeń. Wynika ona głównie z mo˙zliwo´sci dynamicznego zestawiania odpowiednio prekonfigurowanych klastrów oblicze-niowych, które s ˛a dobierane do wymagań danego zadania obliczeniowego, a tak˙ze cechuj ˛a si˛e przewidywalnym

(6)

6

kosztem i niezawodno´sci ˛a oraz natywnym wsparciem standardowych narz˛edzi. Szczególnie atrakcyjna jest rów-nie˙z perspektywa wykorzystania rozwi ˛azań z bogatego i sprawdzonego ekosystemu Hadoop, który jest wszech-obecny tak w publicznych, jak i prywatnych chmurach obliczeniowych, dzi˛eki czemu pozwala na uruchomienie oprogramowania gdziekolwiek, z mo˙zliwo´sci ˛a błyskawicznej integracji z innymi rozwi ˛azaniami, tworz ˛ac zaawan-sowane strumienie przetwarzania danych. Obiecuj ˛acy wydaje si˛e równie˙z trend oportunistycznego wynajmowania maszyn po wielokrotnie ni˙zszych cenach, co jest korzystne zarówno dla dostawcy usług, pragn ˛acego ograniczyć koszty istnienia nadmiarowej infrastruktury, ale tak˙ze konsumentów tych usług, którzy chc ˛a przeprowadzić mo˙z-liwie tanie i szybkie obliczenia.

Niniejsza praca stanowi pierwsz ˛a prób˛e wykorzystania nowoczesnych chmurowych architektur obliczenio-wych do rozwi ˛azywania szerokiej grupy problemów in˙zynieryjnych metod ˛a izogeometrycznych elementów skoń-czonych. Jak staramy si˛e w niej dowie´sć, techniki te mog ˛a być z powodzeniem wykorzystane do przeprowadzania analizy izogeometrycznej na du˙zych siatkach obliczeniowych, opartej na opracowanych w niej solwerach zmienno-kierunkowych z dyskretyzacj ˛a czasu typu implicite, a tak˙ze do szybkiej weryfikacji poprawno´sci nowych algoryt-mów numerycznych z tej dziedziny. Jest tak, poniewa˙z zaproponowany sposób dekompozycji gwarantuje binarn ˛a i zrównowa˙zon ˛a struktur˛e drzewa eliminacji, co pozwala na efektywn ˛a komunikacyjnie dekompozycj˛e algorytmu solwera na zadania obliczeniowe wykonywane w ´srodowisku chmury. Solwer ten jest w praktyce algorytmem grafowym, a typowe klastry chmurowe s ˛a zoptymalizowane pod k ˛atem modelu zrównoleglenia danych (SIMD), dlatego niezmierne istotna do osi ˛agni˛ecia akceptowalnej wydajno´sci okazuje si˛e równie˙z odpowiednia decentra-lizacja samego algorytmu i jego reprezentacja w postaci formalizmów grafowych pasuj ˛acych do tego modelu, czemu po´swi˛econa jest istotna cz˛e´sć tej pracy. W szczególno´sci zaprojektowano i zaimplementowano rozproszony solwer zmiennokierunkowy, słu˙z ˛acy do analizy izogeometrycznej problemów niestacjonarnych ze schematem dys-kretyzacji czasu explicite i implicite zgodnie z grafow ˛a filozofi ˛a “Think Like a Vertex“ (TLAV), który mo˙ze być uruchamiany na dowolnym klastrze Hadoop przez Yet Another Resource Negotiator (YARN). W ramach pracy opisano równie˙z przeprowadzane na jednej z publicznych chmur obliczeniowych eksperymenty numeryczne dla trzech przykładowych problemów in˙zynieryjnych – projekcji terenu, transportu ciepła i symulacji powodzi. Doko-nano wnikliwej analizy osi ˛agni˛etych rezultatów, bior ˛ac pod uwag˛e rozmaite modele skalowalno´sci oraz szczegóły indywidualnych symulacji. Wyniki te wskazuj ˛a na zasadno´sć dalszej eksploracji zastosowania chmur publicznych i nowoczesnych frameworków tak do analizy izogeometrycznej, jak i do innych przedsi˛ewzi˛eć naukowych.

(7)

Abstract

Isogeometric analysis, which attempts to bridge the gap between Computer-Aided Design (CAD) and Computer-Aided Engineering (CAE) has quickly become the subject of research of many scientists, who devi-sed a number of fast direct solvers in the recent years. Special attention should be paid to the alternating directions solver algorithms, which decompose the matrix generated in the process of the discretization of two and three-dimensional computational problems into the Kronecker product of two or three multidiagonal matrices. In the case of two-dimensional problems, the multidiagonal structure allows to obtain linear computational complexity in the sequential version and logarithmic in the parallel version, which encouraged some scientists to explore the opportunities to leverage this property. Along these lines, the solvers which exploit the banded structure of the one-dimensional matrices and decompose the problem into the basic computational tasks based on graph gram-mars that can be easily parallelized were built. This also created the market for new frameworks, which, on top of usual benefits, would also allow other researchers, particularly those who are not directly connected with designing high-performance solvers in general, utilizing isogeometric analysis to conduct simulations in their area of rese-arch. This seems reasonable as the isogeometric analysis requires knowledge from two historically distinct fields, which, as the author of the isogeometric analysis T. J. R. Hughes has initially stated, might become the bottleneck in the wider adoption of the method. All of the existing solutions were designed to work only in the shared me-mory environment, which limited their utility in solving larger problems in a reasonable time frame or ruled it out altogether for problems that were larger than the amount of available memory of a single computational node the scientist could use. There are practical engineering problems which by their nature lead to having to conduct simu-lations on large meshes that translate into a large number of linear equations, sometimes with billions of unknowns. They feature, for example, simulations conducted on cartographic data such as (popular nowadays) air pollution propagation or weather forecasts. Worse yet, existing alternating directions solver algorithms are based exclusi-vely on explicit time discretization techniques, which, being less computationally expensive than the implicit ones, mandate the use of denser discretization with the increase of the size of the computational mesh to retain stability. The existence of this relation in the explicit alternating-directions algorithms mandates computing thousands of redundant time steps, which significantly increases the computation time for already computationally-intensive problems.

In this dissertation, we propose the algorithms for the decomposition of the non-stationary engineering pro-blems into the sequences of systems of linear equations with Kronecker product structure, which enabled the cre-ation of alternating directions solvers with a linear computcre-ational cost. The devised method of decomposition of non-stationary engineering problems enables us to obtain the implicit integration schemes and, consequently, using longer discretization time steps while retaining unconditional stability of the method, which can be used to reduce the computation times by a large margin. Moreover, the classic way of solving the problems in the finite element method domain is predominantly based on decomposing the domain into isolated computational sub-problems, which are then solved concurrently using specialized computing clusters that are interfaced by the low-level Mes-sage Passing Interface (MPI) standard. Despite continued dominance in the scientific community, this standard seems to have begun to lose the momentum outside of it, especially in the era of public clouds, which offer much greater elasticity of running computations. This is the consequence of the new opportunities offered by dynamic provisioning of appropriately preconfigured computational clusters, which answer the requirements of a specific computational task, and are characterized by predictable cost and reliability as well as native support for standard tools. Moreover, the prospect of utilizing solutions from a rich ecosystem of Hadoop which is omnipresent in both public and private computational clouds is particularly attractive as it allows to run software anywhere and with seamless integration with other solutions while creating advanced data processing streams. Opportunistic compu-tation model in which some machines can be temporarily used at a much lower price is an interesting prospect

(8)

8

which is beneficial to the cloud providers, who want to reduce the costs of maintaining redundant infrastructure, but also to their consumers, who want to conduct fast but also cost-effective computations.

This dissertation is the first attempt to utilize modern cloud computing architectures to solve a wide array of engineering problems with the isogeometric finite element method. As we attempt to prove in this work, these tech-niques can be used to conduct modern isogeometric analysis, as well as for rapid verification of the correctness of the new numerical algorithms from this area. It is the case because the proposed way of decomposition guarantees binary and balanced structure of the elimination tree, which allows communication-effective decomposition of the solver algorithm into computational tasks executed in a cloud environment. We argue that as the analyzed solver is a graph algorithm and typical cloud clusters are optimized towards the data-parallel programming model Single Instruction, Multiple Data (SIMD), it is vital – for achieving acceptable performance – to decentralize the algo-rithm itself using graph formalisms which match this model, to which an essential part of this work is dedicated. In particular, we designed and implemented a distributed alternating directions solver for isogeometric analysis of non-stationary problems with explicit and implicit time integration schemes, compliant with “Think Like a Vertex“ (TLAV) graph-processing philosophy, which may be run on any Hadoop cluster through Yet Another Resource Negotiator (YARN). We run numerical experiments in one of the popular public clouds for three exemplary en-gineering problems - terrain projections, heat transfer and flood simulations. We evaluate the results in terms of various scalability models and the details of individual run configurations. These results seem to indicate the vali-dity of further exploration of applying public clouds and their modern frameworks for isogeometric analysis and other scientific use cases.

(9)

1. Introduction... 11

1.1. Motivation... 11

1.2. The purpose of the dissertation... 14

1.2.1. Main thesis ... 14

1.2.2. Structure of this study ... 14

1.3. State of the art... 14

1.4. Summary of open problems... 20

1.5. Major scientific findings ... 21

2. Isogeometric Alternating Directions Implicit Algorithm (IGA-ADI) ... 23

2.1. Graph-grammar multi-frontal solver ... 27

2.1.1. Initialization phase ... 28

2.1.2. Factorization phase ... 31

2.1.3. Backwards substitution phase ... 39

3. IGA-ADI as an object-oriented shared memory framework... 40

3.1. Alternating Directions Implicit Solver ... 43

3.2. Alternating Directions Solver... 46

3.3. Graph-grammar Multi-frontal Solver ... 47

4. IGA-ADI as a vertex-centric data-parallel graph algorithm ... 51

4.1. Orchestration – running centralized algorithm on distributed data ... 52

4.1.1. Imperative approach using distributed collections... 52

4.1.2. Declarative view using functional workflow ... 55

4.2. Choreography – running distributed algorithm on local data... 60

4.2.1. Partitioning... 63

4.2.2. Coordination ... 67

4.2.3. Communication... 69

4.2.4. Execution model ... 72

4.2.5. Other considerations and summary... 73

4.3. Pregelgraph processing framework ... 75

4.4. IGA-ADI as a Pregel graph algorithm ... 76

5. Implementing IGA-ADI on top of vertex-centric cloud-computing frameworks ... 80

5.1. The architecture of Giraph framework... 81

(10)

TABLE OF CONTENTS 10

5.3. Vertex-centric model of the IGA-ADI distributed solver ... 87

5.4. Running the solver... 91

5.5. Necessary changes in Giraph ... 97

6. Numerical results ... 98 6.1. Simulated problems ... 99 6.1.1. Terrain projection... 99 6.1.2. Heat transfer... 100 6.1.3. Cahn-Hilliard equations ... 103 6.1.4. Fluid dynamics... 107

6.2. Scalability of the distributed memory IGA-ADI solver ... 109

6.2.1. Serial scalability... 112

6.2.2. Strong scalability ... 112

6.2.3. Weak scalability ... 117

6.2.4. Universal Scalability Law ... 118

6.2.5. Karp-Flatt metric... 125

6.3. Scalability of the shared memory IGA-ADI solver... 127

6.4. Imbalances in the workload... 129

6.5. Other results and summary ... 134

7. Conclusions and future research ... 136

Appendix A. Unconditional stability of the implicit scheme... 141

A.1. Direction splitting for Laplace operator ... 141

A.2. Stability analysis for the heat transfer problem ... 143

A.3. Explicit Euler scheme... 144

A.4. Implicit scheme ... 145

A.5. Lemmas ... 146

Abbreviations ... 147

List of figures ... 150

List of tables... 153

(11)

Chapter

1

Introduction

The Isogeometric Finite Element Method (IGA-FEM) [19] seems to have provided long awaited solution for a costly dichotomy between Computer-Aided Engineering (CAE) and Computer-Aided Design (CAD) aspects of any modern manufacturing process [20]. Some of its features, particularly higher continuity [74], opened many new and exciting research opportunities for solving problems which involve higher-order differential operators. These include a wide range of important engineering problems ranging from deformable shell theory [6] and phase field modeling [23] to phase separation simulations with Cahn-Hilliard [35] or Navier-Stokes-Korteweg higher order models [37]. It is also used to solve nonlinear problems, such as wind turbine aerodynamics [55], incom-pressible hyper-elasticity [25], turbulent flow [12], transportation of drugs in arterials [54], the blood flow [10], propagation of elastic waves [75], nonlinear flow in heterogeneous media [128] or tumor growth modeling [74]. Many of these problems produce large systems of linear equations which contain integrals that typically require considerable computational cost to initialize let alone solve. For this kind of large, non-stationary simulations, the Multi-Frontal Solver (MFS) [26] is usually considered to be too expensive to apply and iterative solvers are prefer-red. One of such solvers is the Preconditioned Conjugate Gradients Solver (GMRES) [109] which is interfaced by e.g. the Framework for high-performance Isogeometric Analysis (PetIGA) [20]. These solvers require several ma-trix vector multiplications in every time step, and their efficiency depends on the spectral properties of the mama-trix. Recently, the Alternating Directions Solver (ADS) idea was rediscovered in context of IGA-FEM [29, 30] and used for computing isogeometric L2projections on regular patches of elements [29] or as a typical pre-conditioner for non-regular geometries [30]. It is the case because these methods are known to provide linear O(N ) computational complexity for 2-dimensional problems in serial execution and a logarithmic O(log(N )) complexity when execu-ted on a perfectly parallel machine, as the factorization can be done only once, for all right-hand-sides. The latter property has encouraged many researchers to devise various implementations for these solvers on shared memory multi-core platforms [76] or clusters powered by Message Passing Interface (MPI) [128].

1.1. Motivation

Cloud computing has become the undisputed and widely adopted answer to many of most important challenges of modern software industry [33, 49, 126]. It has revolutionized the way software is designed and run primarily by taking away the incidental complexities of managing the underlying infrastructure and providing elastic scalability where resources are always available and paid for in a pay-per-use model [85]. Simultaneously to this profound change a number of essential cloud-computing techniques like MapReduce [22] and frameworks like Hadoop [116] have been proposed to leverage what cloud already has had to offer and extend its horizon even beyond that

(12)

1.1. Motivation 12

capacity. This synergy seems to be pushing the boundaries of what a cloud can be used for to the area which has been historically dominated by High-Performance Computing (HPC), albeit it is still relatively unexplored area [49, 113], especially given the fact it was first proposed more than a decade ago, when the first public clouds were being born [122]. It has also brought exciting prospects to the way scientific simulations can be scaled out and by doing so challenged the omnipresent but consequently ossified MPI standard which begun to be unable to match some of the more important capabilities of what the clouds had started to offer [91]. For this reason many of the distributed-memory computations historically bound to MPI were moved to the cloud-native data-parallel frameworks such as Hadoop and Spark [134] in hopes to improve the efficiency of working with excessively large data sets which are present in the fields such as genomics or astronomy [93, 126]. The widespread adoption of clouds revealed that simulation efficiency is not just the function of solver performance, but also factors like simplicity of implementation, ease of integration with other software, reliability of execution, portability between clusters and, perhaps most importantly, flexibility of scheduling as an outcome of rapid provisioning of abundant resources [91, 106, 108]. All this leads to lower turnaround time, even for many compute-intensive workloads for which actual computation times are often significantly longer than in classic HPC clusters, which were designed to solve such problems [5, 81].

The primary motivation behind this study is to explore the possibility of running large-scale IGA-FEM si-mulations using logarithmic computational cost alternating direction solvers on top of modern data-parallel cloud computing frameworks hosted as managed services offered by leading public cloud providers. The motivation to create a distributed IGA-FEM solver stems from the fact that while there are a number of successful shared me-mory implementations of Isogeometric Alternating Direction Solver (IGA-ADS), their applicability is limited to medium-sized problems due to the finite amount of memory and the number of cores available on a single shared memory computing node. A distributed solver would be able to solve problems of size beyond shared memory so-lvers capabilities. In addition to the horizontal scalability objective, required for solving certain group of problems [14, 48, 106, 132], there are also practical reasons which apply to most simulations. A typical solver hardly ever works in isolation to provide tangible value, but rather, is a part of a bigger simulation workload, which often inclu-des various data processing steps both on the input side, where raw data is filtered to be compatible with the solver, and the output side, where the results are evaluated and converted to a format recognizable by other software or the human eye. Many of these steps involve – at least to certain degree – big-data analytics for which a magnitude of proven solutions exist in the cloud [47]. Furthermore, all leading cloud providers have started to offer an opportu-nistic computation model based on their unused commodity hardware, which could be the long-awaited solution to running large-scale computations at minimal cost [96]. With the ever-increasing appetite for elastic scalability we can expect that the cloud providers will need to maintain more and more unused infrastructure which they will be eager to share at a much lower price to cover some of the costs of its maintenance [102, 117]. While typical HPC clusters interfaced by MPI are – and will continue to be in the foreseeable future – significantly faster in executing compute-intensive scientific workloads [32, 113], the application of the proposed logarithmic computational cost Isogeometric Alternating Directions Implicit Algorithm (IGA-ADI) algorithm makes this performance gap suffi-ciently narrow to accept for IGA-FEM simulations, or even preferable, the potentially lower turnaround time. All this, however, requires a radical shift in how the IGA-ADI algorithm is represented. The centralized perspective on the processing of distributed graph data is known to be difficult to scale and parallelize [14, 45, 79]. We can leverage the proposed IGA-ADI decomposition to construct a communication-efficient data-driven parallelization technique. The computational complexity of the alternating directions solver executed on an ideal parallel machine for non-stationary simulations is logarithmic in every time step, and the number of time steps does not need to grow with the size of the problem. This is the benefit of decomposing the original problem into a sequence of Kronecker product matrices for which the implicit time integration scheme can be obtained. This scheme is not constrained by the Courant–Friedrichs–Lewy (CFL) condition and is stable regardless of the time resolution. The

(13)

cloud-native solver which adopts this method can be slower – by a platform-specific, constant factor, connected to Floating Point Operations Per Second (FLOPS) – than its low-level and hardware-optimized HPC counterpart, but this gap in performance does not reduce the scalability of the solver. On the contrary, the cloud version of the solver allows an arbitrary mesh size as well as an arbitrary time step length (since the two are independent) to be picked and linear computational cost computations to be run (assuming the number of computing threads is much lower than the number of elements in the computational mesh, which is a fair assumption in the light of resource utilization efficiency) on a cloud-managed data-processing service that has the potential to scale seamlessly in the pay-per-use model [47, 96, 106, 122].

(14)

1.2. The purpose of the dissertation 14

1.2. The purpose of the dissertation

1.2.1. Main thesis

The main scientific thesis of this dissertation is:

There is a way to implement an Alternating Directions Solver algorithm for isogeometric analysis simulations which leverages the tensor product structure of B-spline basis functions on regular patches of elements to construct unconditionally stable time integration schemes at linear computational cost. This enables the decomposition of the parallel solver algorithm into a set of basic indivisible tasks that are defined through graph-processing abstractions and can be efficiently scheduled and run in a data-parallel public cloud environment.

1.2.2. Structure of this study

The subsequent content of this study is divided into six chapters. Chapter 2 contains the derivation of the algo-rithms of direction splitting for IGA-FEM that leverage the properties of tensor product B-spline basis functions to obtain the Kronecker product structure of the mass and stiffness matrices. This results in finding an implicit time integration scheme which displays a linear O(N ) computational cost for sequential execution. Chapter 2 also highlights how the most critical aspects of the numerical side of the solver, such as O(log(N )) complexity can be obtained in parallel environments by leveraging the aforementioned Kronecker product structure of the matrices. Then, in Chapter 3, we decompose the solver algorithm into separate subsolvers which compartmentalize the com-plexity into manageable pieces. At the heart of this decomposition there is a set of indivisible tasks which perform atomic matrix operations – productions – and are together capable of modeling the underlying numerical processes. These tasks are used to construct the first implementation of the solver algorithm which experimentally verifies the numerical properties of the novel implicit time integration scheme, while expressing this algorithm through a coherent object-oriented design. This study also constitutes the starting point for finding suitable abstractions for appropriate execution in the cloud environment. The resulting distributed memory version of the algorithm is de-scribed in Chapter 4. Chapter 5 presents the rationale behind selecting the particular cloud computing framework for the implementation of this task-based algorithm. Chapter 6 reports the results of the simulations performed with both solvers and their performance characteristics. Finally, the conclusions and possible directions for future research are presented in the last Chapter 7. This dissertation concludes with an Appendix, where we present the mathematical proof of the unconditional stability of the implicit time integration scheme proposed.

1.3. State of the art

IGA-FEM generates large systems of linear equations with sparse coefficient matrices for non-stationary pro-blems [6, 10, 12, 23, 25, 35, 37, 54, 55, 74, 75, 128]. These non-stationary propro-blems can be solved in either an explicit or an implicit way. In the explicit scheme we factorize the system of equations with mass matrix only once at the beginning of the simulation and then reuse that result for solving new right-hand-sides for each new time step through the process of forward and backward substitution. Examples of such implementations are described in [76, 99]. In the implicit setup, we have to regenerate and factorize the system of linear equations in every time step. There are several ways of solving these systems of equations, once for the explicit setup, and many times for the implicit setup. This can be done:

– By employing an iterative solver for the entire system; – By employing a multi-frontal solver for the entire system;

(15)

– By employing an alternating directions solver, which is the subject of this dissertation.

The iterative solvers are well described in [109] and can solve the IGA-FEM matrices at O(N k) cost, where N is the number of unknowns and k is the number of iterations, depending on the iterative solver algorithm and the conditioning of the matrix. A number of iterative solvers is available through the PetIGA [20] framework.

The iterative solvers may require too many iterations due to poor conditioning of the system, in which case the multi-frontal solvers are often used. A number of direct multi-frontal solvers that employ massive parallelization to provide a good performance in solving these systems have been proposed over the years, such as the MUltifrontal Massively Parallel sparse direct Solver (MUMPS) [2], the Parallel Sparse matriX package (PaStiX) [51] and the Sparse Linear Equation Solver (SuperLU) [68], all of which have their own range of applicability and tradeoffs [1]. While most of these solutions are limited to shared memory environments, some, like MUMPS, can also be run in a distributed memory environment. In this case a tandem of MPI and Open Multi-Processing (OpenMP) [13] is used in which OpenMP is utilized for efficient thread-based parallelization within each MPI process that runs on a different physical node, and MPI is used as the interface for message-passing-based communication between these processes. There is also an alternative approach to solving the systems generated by IGA-FEM referred to as Graph-Grammar Multi-Frontal Solvers [99]. This type of solver was used to implement several Isogeometric Analysis (IGA) problems on one of General Purpose GPU (GPGPU) environments which delivered an unparalleled performance for relatively small meshes [127]. The multi-frontal solver when applied to the entire system of linear equations in the first time step of the explicit method, or in the many time steps of the implicit method, is indeed expensive. Its computational complexity for two-dimensional problems is O(N1.5_p3_{) in two-dimensions}

and O(N2p3_{) in three dimensions [18], where N stands for the number of unknowns, and p stands for the B-splines}

order. The parallel implementations of these solvers cannot reduce the cost down to the logarithmic computational complexity but only to the linear complexity O(N p2_{) on the ideal parallel machine, for two-dimensional problems}

[129]. In the explicit simulations the left-hand-side yields a mass matrix which has a Kronecker product structure for regular grids with B-spline basis functions that we use and can be factorized sequentially with a linear O(N ) computational complexity, as shown in [29, 30]. This factorization still needs the multi-frontal solver for each one of the two (in two dimensions) one-dimensional systems of equations with multi-diagonal matrices (or three systems in three dimensions), each system along a particular axis of the coordinates system. We can use any of these multi-frontal solvers for solving these 1-dimensional problems, but then our parallel implementation will also inherit the constrains of these solvers. Moreover, they would have to be called several times for each spatial direction which makes such integration difficult and vulnerable to performance issues.

A number of explicit dynamics IGA-ADS implementations used these solvers in either sequential or parallel shared memory mode on Linux clusters, and in this case the actual factorization time usually took less than 1% of the total execution time [75, 76]. This is due to the fact that in explicit simulations we factorize the system only once and then generate new right-hand sides for which we perform a single forward and backward substitution for each time step. There is also a parallel distributed memory implementation interfacing Linear Algebra PACKage (LAPACK) dense solver over each sub-domain, which MPI gathers and scatters, targeting the distributed memory Linux cluster [128]. It is known that the majority of computational effort in the L2projection problem – which is at the heart of this kind of IGA-FEM solvers – is spent on the initialization of the systems (generation of the matrices by numerical integration) which itself is a perfectly parallelizable process [76, 128]. The systems are initialized with the problem-dependent values that are derived from formulas which often involve numerical integration. Consequently, the more complex the problem, the higher the returns from parallelization [128].

Researchers have recognized the need to establish a general-purpose IGA framework that would allow various numerical simulations with higher productivity to be conducted by separating the researcher from the incidental complexities of problems that have already been solved through the use of an appropriately defined Application Programming Interface (API). The PetIGA [20] has been created along these lines. It is an extension of Portable,

(16)

1.3. State of the art 16

Extensible Toolkit for Scientific Computation (PETSc) [52] – a suite of data structures and routines for the scalable solution of scientific applications modeled using Partial Differential Equation (PDE) – which provide ready-to-use interfaces for IGA. It can be used on both shared and distributed memory machines and can use both Central Processing Unit (CPU) and GPGPU, even simultaneously. It exposes an interface through which any external solver like MUMPS, PaStiX or SuperLU can be used for matrix factorizations. However, PetIGA does not support the alternating directions solver and uses either iterative or multi-frontal solvers, in both sequential and parallel setups.

The framework which leverages the alternating directions method is simply called IGA-ADS [76]. It allows IGA-FEM explicit dynamics simulations to be conducted or any non-stationary problem whose operator has a separable tensor-product structure. The IGA-ADS framework utilizes the alternating directions solver. It uses a LAPACK solver with Thomas algorithm for sequential factorization of the one-dimensional systems with multi-diagonal matrices at the first time step, and the GALOIS framework [62] for parallel generation of the new right-hand-sides in the following time steps on a shared memory Linux cluster node. Both frameworks, PetIGA and IGA-ADS, have different areas of applicability as a consequence of significant differences in architecture and the usage of different numerical methods in particular.

To summarize, existing explicit dynamics IGA-FEM implementations are either consciously limited to sha-red memory environments because of the lack of intent to solve problems with larger meshes, or to distributed memory, in which case they use MPI for coordination between remote processes and they use a classic domain decomposition paradigm instead of data decomposition.

Some researchers have had some success in using cloud infrastructure for solving classic Finite Element Me-thod (FEM) problems with a multi-frontal solver dedicated to the cloud environment [5]. For the systems that include the matrices which does not display a Kronecker product structure, the alternating directions solver is not applicable, and the local mesh refinements are required. For this kind of problems, the solvers based on IGA-ADS cannot be applied. The main differences between the multi-frontal solver [5] and our alternating directions solver are that a) the first is dedicated to stationary problems and adaptive FEM computations, while ours is dedicated to the time-dependent non-stationary IGA-FEM simulations and that b) the other solver does not split directions and works on a 2-dimensional mesh whose tasks are light and related to the levels of adaptation of the mesh, while our solver splits directions which, results in multiple right-hand-sides that can be processed with massive parallelism that leverages abundant cloud resources.

ADS was introduced to perform finite difference simulations for time-dependent problems [7, 24, 101, 124]. In the ADS method for finite difference simulations the PDE is discretized using a spatial stencil, e.g. the Laplace operator in two dimensions:

du

dt − Lu = f (1.1)

is discretized using central difference with respect to directions x and y du

dt − Lxu − Lyu = f (1.2)

du dt −

ui−1,j− 2ui,j+ ui+1,j

h2 −

ui,j−1− 2ui,j+ ui,j+1

h2 = f (1.3)

The ADS method introduces the time steps, where we approximate the time derivatives by the finite differences, in the odd sub-steps by using du_dt ≈ ut+0.5_−ut

dt , and in the even sub-steps by using du dt ≈

ut+1_−ut+0.5

dt . The forcing

on the right-hand-side is always taken in the previous time step (t in the first sub-step and t + 0.5 in the second sub-step). In other words, the ADS method introduces the intermediate time steps

ut+0.5_{− u} t

dt −

ut+0.5_i−1,j− 2ut+0.5_i,j + ut+0.5_i+1,j

h2 = ut_i,j−1− 2ut i,j+ u t i,j+1 h2 + f t (1.4)

(17)

ut+1_{− u} t+0.5 dt − ut+1_i−1,j− 2ut+1 i,j + u t+1 i+1,j h2 = ut+0.5_i,j−1− 2ut+0.5 i,j + u t+0.5 i,j+1 h2 + f t+0.5 _(1.5)

The resulting two systems of linear equations are tri-diagonal and can be solved in a linear computational cost. For the first time, in [42] (Received 13 February 2018; Accepted 28 September 2018) we have shown that the alternating direction splitting can be generalized to weak formulations of the isogeometric analysis. This was the starting point for the next papers concerning the development of unconditionally stable alternating direction solvers for isogeometric analysis [41, 71, 104].

MPI is a well-established and highly-adopted multi-language message-passing standard which defines a set of low-level communication subroutines that allow program data to be exchanged transparently between both local and remote processes through implicit send and receive instructions [33, 40]. It allows programmers to create distributed software which is portable between most installations of HPC and beowulf clusters without having to address the differences in their architectures and with no significant losses in performance [52]. These differences are encapsulated in specific implementations of MPI such as OpenMPI, MPICH, HP-MPI or Intel MPI [40]. The Single Program, Multiple Data (SPMD) parallelization technique embodied by MPI programs is often used in combination with SIMD dominated by OpenMP [13]. In this setup MPI is used to scale out by employing inter-process communication while OpenMP is used to scale up through the use of inner-inter-process multi-inter-processing [106]. This tandem of technologies is known to display unparalleled performance and flexibility, for which reason it is de-factostandard in high-performance scientific programming in distributed memory environments [32, 106]. This performance, however, often comes at the price of increased incidental complexity [83], which is not directly related to the numerical method itself but to the means of implementing it in a particular environment, which both standards do not attempt or failed to provide [32, 52, 64, 106]. Many of these concerns are repetitive, which is the reason why several linear algebra libraries were written on top of MPI in the hope of increasing the productivity of working with distributed scientific software. Examples include Scalable Linear Algebra PACKage (ScaLAPACK) [16] and, perhaps most notably, PETSc [4], which was also used in PetIGA [20] to a large extent. There were also attempts to streamline the efficiency further through the use of a modularized framework that would provide a set of segregated interfaces that adapt a variety of libraries to provide consistent API to the end user [52]. Some of these attempts might in fact be considered the precursors – at least in certain aspects – of modern cloud computing frameworks. In general, while higher-level abstractions built on top of MPI exist, no widely-adopted general-purpose parallel programming library written for MPI exists to date, although the need for it was indicated many years before the idea of cloud computing came into existence [31].

Over the last decade there has been a radical shift in the design of computer clusters, many of which have been completely rebuilt to serve a new purpose. At the time MPI was created, computing power was a scarce resource, which is why the quality of the clusters was often measured in terms of the number of FLOPS they could support. The advent of big data has changed this objective so that it has become more important to quickly execute simple operations over large volumes of data [126]. Compute-centric architectures consist of processing nodes connec-ted by a high-speed interconnect with a parallel file system such as Lustre. Data-centric architectures, in contrast, combine these two roles, so that the processing nodes manage their own storage. This distinction has a profound impact on the performance. Compute-centric architectures suffer from the overhead of distributed locking, which originates from the assumption that all data can be accessed by any node concurrently with the same latency and throughput. On the other hand, data-centric architectures greatly limit repetitive data movements within the cluster by partitioning the data between the nodes so that each node is responsible for executing the operations on its own partitions. In such a setup there is no hard limit to throughput as it scales out with the number of nodes. This also makes the partition of data the unit of parallelism, which allows one to think about the data processing in a unified manner that is simple to understand, but also brings new challenges. The second difference is that while the compute-centric architectures need to utilize specialized hardware to maintain optimal performance, data-centric

(18)

1.3. State of the art 18

architectures are assembled from commodity hardware which has become widely available with the rise of public clouds. This revolution in hardware architecture has also brought about new challenges for software which MPI was not designed to solve [32]. Perhaps the most serious one among them is fault-tolerance, for which the MPI community was unable to find a clear solution for even with the division of the User-Level Failure Mitigation (ULFM) supplementary standard [64, 89]. Therefore, while it was even the subject of research for FEM-related applications [89], many solvers abandoned this feature which might have been reasonable on highly-reliable har-dware connected by a high-speed interconnect (but only under certain assumptions [89]), but could no longer be reasonable on heavily virtualized multi-tenant clusters connected with Software Defined Networks (SDN). MPI does not provide an integrated file system which could be used to store and retrieve distributed data efficiently. MPI does not feature dynamic resource management mechanisms, which makes it harder to leverage the elasti-city of the public clouds for the advantage of reducing the simulation turnaround time [5]. These deficiencies also contribute to the problem of achieving fault-tolerance in MPI software [89]. Many applications historically bound to MPI were moved to novel cloud computing frameworks, most notably Hadoop [116] with its Yet Another Re-source Negotiator (YARN) [121] as well as a great variety of compatible software, such as Apache Spark [134]. All of this software was built with the cloud in mind and with the objective of greatly reducing the complexity of running distributed software [116, 134]. While MPI and OpenMP tandem has a natural inclination towards an imperative style of programming with low level abstractions, cloud computing frameworks have defined high-level distributed data abstractions, such as Resilient Distributed Dataset (RDD) [134] with a complete operator language to manipulate it that the scientists can use directly to formulate functional workflows which solve the problem at hand with much less boilerplate code. These data structures contain partitioning information which is intrinsically tied to parallelism substantially reducing the effort of managing the two independently.

Researchers promptly investigated the possibility of using Infrastructure-as-a-Service (IaaS) or Platform-as-a-Service (PaaS) [107] models provided by the clouds to remediate some of the typical obstacles the scientific communities were facing which were predominantly connected with the tradeoffs of capacity planning [122, 135]. They pointed out that clouds allow the turnaround time to be reduced to near realtime, even at peak usage times by rapidly provisioning an additional infrastructure but at the same time eliminate the costs if there is no use for it. The availability of the infrastructure allows the researchers to strike the right balance between the computational time, accuracy and cost, which can be computed in advance and reimbursed from specific grants [135]. The hardware as well as the software can be hand-picked by the researchers to match the needs of their simulations without affecting concurrent simulations [135]. Researchers have also pointed out that cloud computing ensures appropriate Quality of Service (QoS) [122] and improves the overall user experience [135]. The general consensus seems to be that loosely-coupled computational tasks and workflows can be run in the cloud-provisioned MPI clusters with efficiency comparable to those in traditional HPC clusters but this is not necessarily the case for highly-coupled parallel applications. This is not the aftermath of lesser computing power or virtualization, but rather a much slower interconnection network particularly for fine-grained communication that also displays high variability [135]. That being said, some scientific workflows have been reported to successfully utilize cloud infrastructure [5, 135]. Last but not least, moving towards an opportunistic computation model has the potential to bring a substantial reduction in the costs of simulations. The possibility of using spot instances, however, is predicated upon the solver’s ability to recover from faults, which proves to be notoriously difficult to implement without built-in framework support and even with it requires conscious programmbuilt-ing as built-increased resiliency can easily harm performance [89, 102, 117]. While cloud computing frameworks were built with fault-tolerance in mind [134], MPI almost completely disregards this feature [64, 89], which is one of the prerequisites for cloud-native software [97]. Researchers have also pointed out that for similar reasons MPI-powered long-running scientific computations with variable load characteristics cannot leverage elastic scalability [49]. To be fair, the cost-efficiency is also often

(19)

neglected in cloud computing frameworks which have their origins and are utilized predominantly in private clouds that belong to large companies trying to solve their business domain problems [49].

Recognizing the benefits of a data-centric style of programming and observing the gradual transition away from classic HPC clusters, part of its community has attempted to evaluate the performance of selected cloud computing frameworks on the compute-centric clusters in the hope of utilizing their existing architectures to serve the new use case. This research has shown that although this is technically possible it is also not efficient in the majority of use cases [126]. On the other hand, the newly formed cloud community have been trying to expand the horizons of what cloud computing can be used for, moving towards more scientific use cases. The best example of this phenomena is machine learning represented by Spark Machine Learning Library [87]. These use cases usually involve large systems of linear equations which require substantial compute power to solve. The researchers compared the performance of different implementations of MPI and Spark in the context of low-rank matrix factorizations, which are one of the more important tools in linear algebra and numerical analysis and for this reason are used in many branches of science and engineering. Using clusters of up to 1600 nodes they arrived at the conclusion that the performance gap at the time the experiments were conducted ranged from 2-25 times with I/O included and 10-40 times without I/O included in favour of MPI. They found the primary reason for this in inefficient scheduling, stragglers, result serialization and task deserialization which accumulate over the iterations to dominate the runtime by an order of magnitude. They did, however, point out that Spark displays good horizontal scalability and that with appropriate optimizations should be able to close this efficiency gap [32, 33]. Other researchers arrived at similar conclusions independently comparing a tandem of OpenMP and MPI with Spark running on commodity hardware in the Google Cloud Platform (GCP), solving the classic Singluar Value Decomposition (SVD) and K-Nearest Neighbors (KNN) problems [106]. They also pointed out that while there are many quality packages for solving such problems in HPC, there are still only a few in the case of cloud computing platforms. Certain practical simulations have also been performed on top of Spark, such as the one in the field of High-Energy Physics [113]. They reported that was roughly twice as slow as the corresponding MPI software but maintained good horizontal scalability without much tuning as opposed to MPI for which maintaining good scalability is much more complex on its own [113]. They also attributed this deficiency primarily to the lack of high-performance linear algebra libraries which employ vectorization for Spark. While certain options have been made available, since then they have still been unable to match what MPI has to offer [33]. They also stressed the fact, that while Spark isolates the user from the complexities of running distributed software on excessively large data sets, it also attempts to optimize the workloads in a way that is not always ideal. Moreover, abstractions, as convenient as they are, also bring certain overheads by their mere existence. On the other hand, the authors confirm that MPI fails to address the needs of data-intensive workloads and that there is the potential for improving Spark’s performance so its use for numerical analysis is not definitely ruled out.

Some researchers have also been looking for ways to increase the performance of cloud computing frameworks in compute-bound use-cases, while retaining their highly appreciated benefits that increase productivity. The ma-jority of effort in this area has been dedicated to Spark, which by addressing the needs of iterative computations has become more comparable to HPC [39]. These attempts fall into two categories. Libraries such as Smart [125], Spark-MPI [80] and MPIgnite [90] attempt to re-implement Spark on top of low-level MPI communication ro-utines. They often diverge from the original programming model of Spark and loose compatibility with standard Hadoopinstallations [33]. Spark+MPI [3] and Alchemist [33] employ a different strategy in which they transfer the data from Spark workers to the MPI processes and back once the compute-intensive task is complete. By doing so they can retain the original programming model and partially leverage built-in fault-tolerance by isolating the failure to the closest computation stage. Spark+MPI utilizes the Hadoop Distributed File System (HDFS) [116] to transfer the data back and forth between MPI and Spark, which is simple and fault-tolerant but displays significant I/O overheads [3] and is probably unfit for processing the dense matrices of linear algebra [33]. To remediate this

(20)

1.4. Summary of open problems 20

deficiency, a socket-based communication was proposed in Alchemist that greatly improves the throughput but does not eliminate this potential bottleneck completely either. Moreover, it requires two copies of data to be sto-red, which greatly decreases the memory-efficiency [33]. Regardless of this, recent project addresses the specific problem of inefficient linear algebra transformations within existing Spark workflows and for this reason is worth following. In general, while modern cloud computing frameworks like Spark [134] definitely did not go unnoticed in the broader academic community and have become critical to certain branches of science itself such as data analytics and artificial intelligence studies, only a few papers discuss the possibility of applying these as the back-bone of a wider range of numerical solvers [32, 33, 106, 113]. What seems to be a common consensus is that the possibility of interfacing and integrating MPI with Spark is worth investigating, as on the one hand it might bring performance improvements, and better support data-intensive workloads on the other.

Some researchers, instead of comparing the performance of the traditional compute-centric clusters with data-centric clusters in the computational tasks that both can do, have focused on solving the problems of the iterative processing of large graphs which are notoriously difficult to scale and parallelize [79] and were known long before the cloud computing frameworks came into existence. Researchers had been aware that the software and hardware that were powering mainstream scientific simulations were not necessarily effective for solving the graph problems [78]. Big data movement shifted the focus from compute-driven to data-driven applications, providing new types of clusters and software that could better exploit data locality. Researchers from Google argued that the issue with processing large graphs is not necessarily the structure of data itself, but a false perspective on the process that the majority of graph algorithms share [79]. Most of them employed a shared memory view on the graph based on the assumption that accessing any element of the graph yields the same cost, which was not true for HPC clusters [78] and even more false in the case of the cloud [49]. Pregel [79], discussed in-depth later in this study, was created as a distributed graph-processing framework capable of running PageRank on the Internet scale through extending the old Bulk Synchronous Parallel (BSP) [120] model. This idea was later adopted in a number of open-source frameworks [46] which are also compared and contrasted while selecting the one most suitable for implementing IGA-ADI distributed solver. As the novel "Think Like a Vertex" (TLAV) philosophy is inherently data-centric there are compelling reasons why the traditional SPMD parallelization technique embodied in MPI might be less attractive for this particular use case than the unified SIMD model employed by the cloud computing frameworks [49, 106, 113].

1.4. Summary of open problems

This study focuses predominantly on improving the efficiency of IGA-ADI simulations for large-scale pro-blems. The issues that have to be addressed to attain this goal originate from the deficiencies of the computational method as well as the lack of appropriate cloud-native representations and can be summarized as follows:

– The alternating directions solver was not used for time-dependent implicit simulations with a finite element method using B-spline basis functions;

– There was no mathematical proof of the unconditional stability of the implicit simulations with the alterna-ting directions solver and finite element method discretisations with B-spline basis functions;

– IGA-FEM simulations that use the explicit time-stepping discretization method require a time resolution that increases with the number of finite elements to maintain stability. This, however, makes processing large problems impractical due to the drastically increasing computation times – both because it takes longer to process individual time steps which feature more degrees of freedom but also because it is necessary to process a large number of otherwise useless intermediary steps;

(21)

– No distributed memory IGA-ADI frameworks exist to date. The only distributed memory framework for IGA is dedicated to parallel L2_{projections and explicit dynamics simulations, it does not support implicit}

dynamics simulations and is reported to require substantial effort to implement IGA-ADS on top of it [128]. It is based on MPI and works on highly-interconnected data which limits the possibilities of leveraging public clouds;

– IGA simulations, like many other scientific simulations, suffer from an increased turnaround time caused by limited hardware availability due to inaccuracies in capacity planning. In practice, this most often rules out iterative experimentation which might bring new perspectives to research and limits their original scope; – Most implementations of IGA-FEM solvers are custom-built. This limits the possibility for reuse and

de-creases the efficiency of conducting research. Certain types of solvers might be too complex to implement without properly compartmentalizing the complexity;

– Cloud-computing frameworks were never evaluated or proposed in the context of implementing solvers for IGA. This is also the case for most scientific solvers which still are limited to shared memory environments or MPI-based HPC or Beowulf clusters. Cloud-computing frameworks are well-adopted in the industry but mostly unexplored in the scientific community which is also the reason they might be missing appropriate optimizations necessary for strictly numerical use;

– Cloud-computing frameworks leverage data-parallelism which provides linear scalability for flat data struc-tures. That being said it is notoriously difficult to use this model on data with high interdependencies as maintaining good parallelization increases the communication overhead and isolating the data decreases the parallelism.

1.5. Major scientific findings

In this dissertation we not only propose the solutions to some of the problems mentioned above, but also explore new possibilities of running academic software which might substantially improve the productivity of research. The major findings may be summarized as follows:

– The alternating directions solver applied for time-dependent simulations using finite element method discretizations with B-spline basis functions results in an unconditionally stable implicit time integra-tion scheme. This solver leverages the tensor product structure of B-spline basis funcintegra-tions over a regular patch of elements to separate the 1-dimensional components of the B-spline basis functions and derive an unconditionally stable time integration scheme. Consequently, it can be used to solve problems with any practical mesh size and any time step size where the accuracy and solution time are a function of these two factors.

– Partitioning of the elimination tree may be a better alternative to the traditional domain decomposi-tion approach. This approach makes IGA-ADI effectively a large graph problem which can be reasoned about using corresponding terms and solved using a variety of modern cloud-computing frameworks. It is optimized for better throughput. This parallelization strategy can be also combined with the domain decom-position approach to achieve better performance for a subset of problems.

– IGA-ADI can be represented as a vertex-centric iterative graph algorithm and leverage modern cloud computing frameworks designed for this purpose. The algorithm can be distributed alongside the data to exploit the data locality feature of the graphs, which greatly reduces the communication overhead and benefits horizontal scalability.

(22)

1.5. Major scientific findings 22

– The cost of performing large-scale IGA-ADI simulations can be significantly reduced when run on a cluster of preemptible nodes. Cloud computing frameworks are resilient in design which enables the use of preemptive (spot) instances which are on average 4 times as cheap as the regular ones with exactly the same hardware. Taking the elasticity of the cloud into account, the computations can be performed faster at the same cost or cheaper at the same speed.

(23)

Chapter

2

Isogeometric Alternating Directions Implicit

Algorithm (IGA-ADI)

The IGA-ADI can be used for performing fast simulations of time-dependent problems of the form given by equation (2.1) with some predefined initial configuration and boundary conditions

du

dt − Lu = f (2.1)

where L = Lx + Ly is a separable differential operator (e.g. Laplacian) and L = Lx+ Ly = ∂

2_u

∂x2 +

∂2u ∂y2. Following the ideas for the explicit dynamics solver [76] we can derive the implicit dynamics solver and represent it in terms of basic indivisible tasks which form a Directed Acyclic Graph (DAG).

First, we apply the alternating direction method with respect to time. We introduce the intermediate time steps in order to separate the differential operator and to process the x derivatives in the first sub-step given by equation (2.2), and the y derivatives in the second sub-step given by equation (2.3), which is the key operation behind this method.

ut+0.5− ut

dt − Lxut+0.5− Lyut= ft (2.2) ut+1− ut+0.5

dt − Lxut+0.5− Lyut+1= ft+0.5 (2.3) We can multiply each equation by dt and move the terms

ut+0.5− dt ∗ Lxut+0.5= ut+ dt ∗ Lyut+ dt ∗ ft (2.4)

ut+1− dt ∗ Lyut+0.5= ut+0.5+ dt ∗ Lxut+0.5+ dt ∗ ft+0.5 (2.5)

If we compare the implicit dynamics solver to the explicit dynamics solver [76], we can see that now on the left-hand-side we also keep the derivatives in their respective directions. In other words, we include the differential operator in the L2_{projection process. Now, we transform the problem into the weak form by taking the L}2_-scalar

products with test functions v

(ut+0.5− dt ∗ Lxut+0.5, v)L2 = (u_t+ dt ∗ L_yu_t+ dt ∗ f_t, v)_L2 (2.6) (ut+1− dt ∗ Lyut+0.5, v)L2 = (ut+0.5+ dt ∗ Lxut+0.5+ dt ∗ ft+0.5, v)L2 (2.7) Let us focus on the first one moving forward. We multiply it by the test function v

Z (ut+0.5− dt ∗ ∂ut+0.5 ∂x )vdxdy = Z (ut+ dt ∗ Lyut+ dt ∗ ft)vdxdy (2.8)

(24)

24

and then integrate by parts the second of its terms Z ut+0.5vdxdy + dt Z ut+0.5 ∂ut+0.5 ∂x ∂v ∂xdxdy = Z (ut+ dt ∗ Lyut+ dt ∗ ft)vdxdy (2.9)

It can be shown that the higher dimension B-spline basis can be constructed from a tensor product of as many 1-dimensional B-spline bases as there are dimensions

Bn_X₁_,...,X_n= B_X1₁∗ B1

X2∗ · · · ∗ B

1

Xn (2.10)

where Xkdenotes the k-th dimension, where k = 1, ..., n. Using this property we discretize our problem by using

a two-dimensional B-spline basis where each of its elements can be given as Ni,j;p(x, y) = Bi;px (x)B y j;p(y). We approximate ut+0.5=Pi,ju t+0.5 i,j B x i;p(x)B y

j;p(y) using the test function v = B x k;p(x)B

y

l;p(y). The equation (2.8)

becomes X i,j ut+0.5_i,j " Z

B_i;px (x)B_j;py (y)Bx_k;p(x)B_l;py (y)dxdy + dt ∗

Z _∂(Bx i;p(x)B y j;p(y)) ∂x ∂(B_k;px (x)B_l;py (y)) ∂x dxdy # = = Z (ut+ dt ∗ Lyut+ dt ∗ ft)Bk;px (x)B y l;p(y)dxdy

Now, since _∂x∂ By_j;p(y) = 0 and _∂x∂ B_l;py (y) = 0, we can skip the basis functions which prevented the matrix from being a tensor product. This results in

∂ ∂x(B x i;p(x)B y j;p(y)) = ∂ ∂x(B x i;p(x))B y j;p(y) + B x i;p(x) ∂ ∂x(B y i;p(y)) which implies X i,j ut+0.5_i,j Z Bi;px (x)B y j;p(y)B x k;p(x)B y l;p(y)dxdy + dt ∗ Z _∂(Bx i;p(x)) ∂x B y j;p(y) ∂(Bx k;p(x)) ∂x B y l;p(y)dxdy = =X i,j ut+0.5_i,j Ψx= Z (ut+ dt ∗ Lyut+ dt ∗ ft)Bxk;p(x)B y l;p(y)dxdy

Now, on the left-hand-side, we can group B-splines of Ψx with respect to y and take out the common part, as

R Bx i;p(x)B y j;p(y)B x k;p(x)B y l;p(y)dxdy =R B x i;p(x)B x k;p(x)dxR B y j;p(y)B y l;p(y)dy Ψx= Z

B_i;px (x)B_k;px (x)By_j;p(y)B_l;py (y)dxdy + Z ( ∂ ∂xB x i;p(x))( ∂ ∂xB x k;p(x))B y j;p(y)B y l;p(y)dxdy

so our left-hand-side sum of integrals Ψ becomes the Kronecker product of two integrals Ψx= Z (B_i;px (x)B_k;px (x) + ( ∂ ∂xB x i;p(x))( ∂ ∂xB x k;p(x)))dx ⊗ Z (B_j;py (y)By_l;p(y))dy (2.11) We can apply the same process to the second equation (2.7), where we just replace the derivatives with respect to the x direction, with the derivatives with respect to the y direction

Ψy=

Z

(B_i;py (y)B_k;py (y) + ( ∂ ∂yB y i;p(y))( ∂ ∂yB y k;p(y)))dy ⊗ Z (Bj;px (x)Bxl;p(x))dx (2.12) This way both sides of the Kronecker product contain the products of one-dimensional B-splines along the same direction - x and y for equation (2.11) and y and x for equation (2.12) respectively. Each B-spline from a 1-dimensional B-spline basis of the order of p spans over exactly p + 1 knots in its domain [18, 19]. This means that the integral of the product of any two B-splines from the same 1-dimensional B-spline basis that are distant from each other by more than 2p + 1 elements is zero, i.e.

Z 1 0 B_i;px (x)B_j;px (x)dx =    0 if |i − j| > 2p + 1 const 6= 0 otherwise

(25)

Additionally, if we scale each element the B-spline pair spans over to be of unit size [0, 1], we can observe that the product of the B-splines depends not on their specific positions in the domain i and j but on the difference in their positions. We can therefore conclude that each element is described by the same (p + 1) × (p + 1) square coefficient matrix regardless of the specific problem solved. For p = 2 we have to compute a total of 9 integrals in a 3 × 3 matrix with only 4 unique values due to the symmetry in the matrix which can be computed analytically and represented as the following fractions

      R1 0 B x 1;2(x)Bx1;2(x)dx R1 0 B x 1;2(x)B2;2x (x)dx R1 0 B x 1;2(x)Bx3;2(x)dx R1 0 B x 2;2(x)Bx1;2(x)dx R1 0 B x 2;2(x)B2;2x (x)dx R1 0 B x 2;2(x)Bx3;2(x)dx R1 0 B x 3;2(x)Bx1;2(x)dx R1 0 B x 3;2(x)B2;2x (x)dx R1 0 B x 3;2(x)Bx3;2(x)dx       =       1 20 13 120 1 120 13 120 9 20 13 120 1 120 13 120 1 20       (2.13)

The second component of the left Kronecker product is also a constant, dependent only on the order of the B-splines used p. Moreover, it can be shown that the coefficient matrices are equal for all directions. Let us go back to equation (2.11), which can be presented in a generic matrix form

Mx = b (2.14)

where M = A ⊗ B and A is n × n square matrix and B is m × m square matrix so

M =          R Bx 1,pB1,px dxR B y 1,pB y 1,pdy R Bx1,pBx2,pdxR B y 1,pB y 1,pdy · · · R B1,px BNxx,pdxR B y 1,pB y Ny,pdy R Bx 2,pB1,px dxR B y 1,pB y 1,pdy R Bx2,pBx2,pdxR B y 1,pB y 1,pdy · · · R B2,px BNxx,pdxR B y 1,pB y Ny,pdy .. . ... ... ... R Bx Nx,pB x 1,pdxR B y Ny,pB y 1,pdy R BNxx,pB x 2,pdxR B y Ny,pB y 1,pdy · · · R BxNx,pB x Nx,pdxR B y Ny,pB y Ny,pdy          (2.15) is algebraically equivalent to          R Bx 1,pB1,px dx R B1,px Bx2,pdx · · · R B1,px BxNx,pdx R Bx 2,pB x 1,pdx R B x 2,pB x 2,pdx · · · R B x 2,pB x Nx,pdx .. . ... ... ... R Bx Nx,pB x 1,pdx R BxNx,pB x 2,pdx · · · R BNxx,pB x Nx,pdx          ⊗          R By 1,pB y 1,pdy R B y 1,pB y 2,pdy · · · R B y 1,pB y Ny,pdy R By 2,pB y 1,pdy R B y 2,pB y 2,pdy · · · R B y 2,pB y Ny,pdy .. . ... ... ... R By Ny,pB y 1,pdy R B y Ny,pB y 2,pdy · · · R B y Ny,pB y Ny,pdy         

From the properties of the Kronecker product we have M−1 = (A ⊗ B)−1 = A−1 ⊗ B−1_{. This implies the}

following derivation of the linear O(N ) computational cost solver algorithm. From the definition of the Kronecker product we know that

M = A ⊗ B =          A B11 A B12 · · · A B1m A B21 A B22 · · · A B2m .. . ... . .. ... A Bm1 A Bm2 · · · A Bmm          (2.16)

The right-hand-side and the solution vectors are partitioned into m blocks of size n each xi= (xi1, . . . , xin)

T

bi= (bi1, . . . , bin)T

(2.17) We can rewrite our system as a block matrix equation

               AB11x1+ AB12x2 + · · · + AB1mxm = b1 AB21x1+ AB22x2 + · · · + AB2mxm = b2 .. . ... ... ... ABm1x1+ ABm2x2+ · · · + ABmmxm= bm