Automatic parallelization of finite element CFD code using HPC middleware

(1)

AUTOMATIC PARALLELIZATION OF FINITE ELEMENT CFD

CODE USING HPC MIDDLEWARE

Satoshi ITO* and Hiroshi OKUDA†

* Collaborative Research Center for Computational Science and Technology, Institute of Industrial Science, The University of Tokyo

Komaba 4-6-1, Meguro-ku, Tokyo 153-8505, Japan e-mail: ito@fsis.iis.u-tokyo.ac.jp

†

Research into Artifacts, Center for Engineering (RACE), The University of Tokyo Kashiwanoha 5-1-5, Kashiwa, 277-8568 Japan

e-mail: okuda@race.u-tokyo.ac.jp

Web page: http://nihonbashi.race.u-tokyo.ac.jp/

Key words: Finite element method, Parallel computing, PSE, HPC-MW

Abstract. Computational fluid analysis requires huge amounts of computational resources. However, for the usual developer, parallelization is difficult as it requires special programming skills. In this study, the “HPC Middleware” is employed, which supports parallelization and optimizations for easily developing parallel code. The parallel efficiency of the developed code is also discussed.

1 INTRODUCTION

Parallel computing involves some unique techniques such as domain decomposition, message passing, vectorization etc. The extra work required for implementing these techniques is extremely burdensome for application developers, often resulting in a very time consuming development and buggy code. In addition, since recently available architectures vary from PC-clusters to SMP clusters, the optimization strategies used to efficiently exploit the available hardware vary from case to case. For the development of parallel finite element fluid analysis code, the present study employs the “HPC middleware”1 (http://hpcmw.tokyo.rist.or.jp/index_en.html), which has been developed under the “Frontier Simulation Software for Industrial Science” project (http://www.fsis.iis.u-tokyo.ac.jp/) at the Institute of Industrial Science (IIS), the University of Tokyo, as a research project of IT-program under Research Revolution 2002 organized by Ministry of Education, Culture, Sport, Science and Technology, Japan..

2 HPC MIDDLEWARE

2.1 Concept of HPC middleware

(2)

plugging-in the HPC-MW (fig.1).

Figure 1: Utilization of ‘HPC-MW’

In order to achieve this, we abstract FEM procedures and extract common patterns. By providing these patterns as functions of the platform, a developer is able generate analysis code without considering FEM procedures. In addition, the developed code is automatically parallelized, as these techniques are usually associated with FEM procedures.

2.2 The system

FEM consists of four main processes: I/O, construction of the stiffness matrix, calculation of the right-hand side vector and the solver. All processes are provided by subroutines of the HPC-MW. Table 1 shows examples of subroutines from HPC-MW.

In I/O, the data structure is extremely important because it directly affects the performance of the developed code. Here, by data structure we mean the memory usage of matrices and vectors. In HPC-MW, the compressed row storage (CRS) format is employed. This format accommodates the sparse matrices employed by FEM. In addition, as it saves memory resources, it is indispensable in large scale analysis. HPC-MW also provides automatically generated parallelized subroutines for mesh data input and for result data output.

The data structure, in CRS format, although it’s suitable for large scale FEM analysis, is rather complicated to work with. In constructing the global stiffness matrix, this format has to be used. HPC-MW provides a subroutine for the construction of this global matrix. This subroutine requires only the element stiffness matrix, which it then assembles into global matrix, while also performing CRS format data checking.

In calculating the right-hand side vector, operations such as matrix-vector multiplication and scalar product are needed. These operations also require the CRS format. Thus, subroutines for matrix-vector multiplication and scalar product are provided by HPC-MW. The block matrix method is employed so that the subroutines used for matrix-vector multiplication has some types that is corresponded to the typical block size.

(3)

Table 1 : Example of HPC-MW subroutines I/O

hpcmw_get_mesh Input mesh data hpcmw_write_result Ouput result data Construction of the global matrix

hpcmw_mat_con CRS table creator

hpcmw_matrix_allocate Allocation of matrix memory

hpcmw_Jacob Shape function calculation hpcmw_matrix_assemble Assemble an element matrix to the global matrix Calculation of the right-hand side vector

hpcmw_matrix_vector Matrix-vector multiplication hpcmw_vector_innerProduct Inner product

Solver

hpcmw_solver_11 Linear solvers for CG, GMRES, BiCGSTAB, etc

3 SAMPLE APPLICATION

3.1 Governing equations and algorithms

Governing equations are the Navier-Stokes equations and the continuum equation. P1P1 finite element is employed for spatial discretization, and the predictor-multicorrector method is applied for temporal discretization. For stabilization, SUPG/PSPG method2 is also employed.

3.2 Utilization of HPC-MW

When using HPC-MW, a developer can parallelize his code without using domain decomposition or data structure. As HPC-MW provides numerous subroutines, code can be parallelized by adding and combining them. Fig. 2 shows a flowchart of the code developed in the present study.

(4)

4 NUMERICAL RESULTS 4.1 PALLAREL EFFICIENCY

Parallel efficiency of the developed code was measured. The simulated model is a lid-driven cavity flow. Number of nodes was 1,000,000, and d.o.f was 4,000,000. The specifications of the PC-cluster are shown in table 2. Fig. 3 shows the speed-up of the solver provided by HPC-MW. About 94% parallel efficiency is achieved with 24 processors for a middle size model.

Table 2 : Specification of the PC-cluster CPU Xeon 2.8GHz Memory 2Gbyte Network Myrinet 0 4 8 12 16 20 24 0 4 8 12 16 20 24 Number of PEs Speed-up Solver Ideal

Figure 3: Speed-up of the solver

4.2 DEVELOPMENT EFFICIENCY

Table 3 shows the number of steps in the developed code. The number of total steps is 4,250 and the main part has 1,800. Thus, about 57% of the steps are covered by HPC-MW.

Table 3 : Number of steps in the developed code Number of steps Ratio (%)

Total 4,250 100

(5)

5 CONCLUSIONS

- A hardware independent platform for FEM was designed and developed. Various procedures common to FEM are provided by the platform, and parallelization is automatically achieved.

- An incompressible fluid analysis code was developed using HPC-MW. HPC-MW saved about 57% in steps of developed code.

- About 94% parallel efficiency was obtained for a middle size model ran on a PC-cluster

AKNOWLEDGEMENT

This research was done in "Revolutionary Simulation Software for the 21st century (RSS21)" project supported by next-generation IT program of Ministry of Education, Culture, Sports, Science and Technology (MEXT).

REFERENCES

[1] http://www.fsis.iis.u-tokyo.ac.jp/en/theme/hpc/

[2] T. E. Tezduyar, “Incompressible flow computations with stabilized bilinear and linear equal-order-interpolation velocity-pressure elements”, Computer Methods in Applied