Approach to efficient parallel computing organization

(1)

Summary

The three-level model for software development is suggested in the article. Each level of this model provides special tasks analysis and performance. The third level (Tasks Per-formance Layer) is considered deeper. Mathematical approach to efficient parallel computing organization is proposed. Also the possibility of existent parallel technologies usage is analyzed.

Keywords: Modified Model-Driven Software Development, Tasks Performance Layer, parallel computing, mathematical approach

1. Introduction

Modern distributed systems, which are operated with different types of terminal equipment in the Internet environment, demand a special methodology for their development and deployment. The Model-Driven Software Development (MDSD) methodology is one of the most widely used today. But it doesn’t provide a formalized approach to decide a problem of errors correction at the early stages of software development life cycle – at the stage of application business-modeling.

Recently for raising the abstraction level of software development environments the Software Engineering direction has being in progress, it is called Model-Driven Engineering (MDE). The Modified Model-Driven Software Development (MMDSD) methodology was previously proposed [4]. MMDSD methodology is a complex approach that actually complements MDE with steps of early prototyping and testing, and also allows optimizing a model of abstract level using finite state machine method.

The further progress of MMDSD methodology is described in the article. It is suggested to raise efficiency of distributed systems using paralleling methods and technologies. The optimal system performance is obtained by using mathematical models for description of system parallel threads.

2. MMDSD basic concepts

The approach aims to modeling the software environment for business-process realization. MMDSD consists of three levels of the software environment modeling:

1. The main tasks of the Abstract Business-Process (BP) Layer are enterprise processes analysis from the application point of view, its abstract description, abstract business-process model development, model analysis using mathematics base of finite state machine theory and getting the results for business-process modification reasonability.

2. A moving from enterprise business-processes to computation business-processes is taken place while going from the Abstract BP Layer to the Computational BP Layer. IT engineers, who

(2)

can formalize which abstract business elements could be automated, which information technolo-gies could be used for computation business-processes realization and its management, take part at this level. A network analysis mechanism is suggested to use for model optimization at the Com-putational BP Layer.

3. The Tasks Performance Layer provides the computation business-processes model by split-ting the tasks into different nodes and each of it is considered as a final and relatively independent computational task.

The Abstract BP Layer and the Computational BP Layer were shown in detail in [4], so this article aims to describe the Task Performance Layer.

3. Execution Model of the Tasks Performance Layer

It is necessary to develop a task performance model for efficient functioning of the third level. Each task is analyzed to be paralleling. The main analysis criteria are possibility and reasonability of paralleling according to the infrastructure.

It is difficult to develop a new efficient parallel program or adapt an existent sequential pro-gram for parallel computing because of existence of different approaches to computation organization when multiprocessor systems are used. In this case it is important to make structured analysis of algorithm and detect its inner parallelism. This analysis demands deep understanding of the task. Sometimes a programmer doesn’t understand exactly the depth of the algorithm and a mathematician couldn’t correct the program. This challenge could be solved by using the algo-rithm model which is understandable for the programmer and the mathematician.

Thus for rising the efficient of parallel computing organization the process of parallel pro-grams development are divided into four stages [3]:

1. Mathematical task definition;

2. Structured analysis of the algorithm and computation scheme organization; 3. Program coding;

4. Program debugging.

The deeper considering of the first and the second stages is presented below.

Organization of the parallel execution of a task starts with the creation of an algorithmic graph of computing process; it is mean the graph that represents information dependences of the algo-rithm of the task execution.

G = (V, E),

where V = {1,...,|V|} is the set of graph vertices that represents algorithm operations, E is the set of graph edges that specifies the order of the operations.

The graph of the algorithm is acyclic and oriented. But in general the graph is a multigraph, it is mean that two vertices could be merged by several edges.

The next step is graph conversion to an exact parallel form. This parallel graph is analyzed by using the existing algorithms of graph theory: definition of shortest paths in the graph, computa-tion of critical paths etc.

Each task demands individual analyses to be made.

If computational system architecture for a specified computational task execution is given it is necessary to make a schedule for parallel algorithm realization. For this the following set is de-fined:

(3)

Hn = {(i, Pi, ti): i א V},

in which for each operation i א V the number of processors using for its execution Pi and

be-ginning time ti of the operation are determined. For the schedule realization the following

requirements should be done:

1. The same processor could not be assigned to different operations at the same time: ׊i, i א V: ti = tjฺ Pi Pj.

2. To the assigned time of operation beginning all of the necessary data have to be already de-termined: ׊ (i, j) א R titi + , where is the execution time of one operation.

The graph of the algorithm computational scheme together with the schedule is considered as a model of parallel algorithm that is realized on the n processors:

An (G, Hn).

The next important step is estimation of the computational systems efficiency. But at first it is necessary to define an execution time of the parallel algorithm. So the execution time of the parallel algorithm is determined as a peak value of time that is used in the schedule:

) ( max ) , ( +

τ

∈ = i n n t V i H G T .

For the fixed computational graph it is reasonable to make the schedule that provides a mini-mal execution time of the algorithm:

) , ( min ) ( n n n n T G H H G T ∀ = .

For the efficiency evaluation of the task parallel realization the execution time of the sequen-tial realization on the one processor T1 should be defined with taking into account all sequential

variations of the algorithm:

) ( min 1 G G T = ,

where the minimal value is taken among all possible sequential algorithms of the task execu-tion.

The efficiency estimation of the parallel execution of the task is realized using special model that depends on chosen system architecture [6]:

1. Model of homogeneous computational system without structure. In this case the computa-tional system is considered as a set of computacomputa-tional nodes that can realize some general tasks. The structure of nodes’ links is not considered. Often this model is used for homogeneous parallel systems where the nodes usually are simple processors.

The acceleration coefficient of the parallel algorithm depends on the execution time of the al-gorithm on the one processor T1 and the execution time of the parallel algorithm on the n

processors:

n

T T S ₌ 1 .

The efficiency coefficient of homogeneous computational system is defined as

n S

(4)

2. Model of computation nodes. The model of computation nodes allows to describe the struc-ture of the nodes’ links. The group of computation nodes is a set of software and hardware machines that are merged in the computational system. The model of computation nodes includes a structure description of the computational system H and a work algorithm of computation nodes A:

S = < H, A >. The structure of computation nodes H is described as

H = <C, G >,

where C = {ci} is the set of computational nodes ci, i =1…N, G is the description of

connec-tion structure between computaconnec-tional nodes.

The structure of computational nodes is described by using a graph: G = <C, E >,

where the graph vertices represent computational nodes C and the graph edges E represent the links between computational nodes.

The work algorithm of computation nodes A provides a consistent work of computation nodes and links between them when general task is realized. The algorithm could be represented as

A =A(P(D)), where D is input data for parallel program P.

Considered models are used for systems with homogeneous structure of elements. For realiza-tion of heterogeneous computarealiza-tional systems the model of computarealiza-tion nodes is generalized in model of functional units and model of distributed systems with a schedule.

3. Model of functional units. In this model the limitations of heterogeneity are overcome by introduction the individual characteristics of the computational nodes.

Let’s the execution time of the task A is T and the task consists of N basic operations, then real efficiency of the system is defined as

T N r= .

There is a peak efficiency parameter for this model. Let’s the peak efficiency of the computa-tional node i is π_i then the peak system efficiency is

¦

= = n i i 1

π

.

On the other hand if the task A is executed k-times then the peak efficiency of the system could be defined as the maximum of the system real efficiencies:

{ }

k

r

max

=

π

.

For definition of the system acceleration coefficient it is necessary to introduce a new parame-ter which called system occupancy and is defined as

π π α α i i n i i ip P=

¦

= = , 1 ,

where piis the computational node occupancy during a period of time [t1, t2] which depends

(5)

) ( ) ( ) ( ) , ( 1 2 1 2 2 1 t V t V t t t t p_i − − ⋅ = π .

So the acceleration coefficient of heterogeneous computational system is determined as

i i n i i i p S

π

max 1

¦

= = .

4. Model of system with the schedule. In this version the model of functional units is general-ized for the case when functional units could be available for the task execution only during some period of time according to the schedule. The model of system with schedule allows to describe the distributed computational systems that use batch processing system for distribution of access to the computational resources.

In this model the schedule is defined for each computational node, it is the time function hi(t)

and hi(t) = 1 if the computation node is used for computing at point of time t, and hi(t) = 0 in the

other case. Computational node availability is a part of time slot [0, t] when computational node is used for the task execution:

³

= T i i h t d T T 0 ( ) 1 ) (

τ

ρ

.

There is a new parameter which is called an etalon execution time

T

_i

>

0

of the task A on the computational node i that is gotten by using an etalon algorithm and an etalon efficiency

π

_iof the computational node i. It is mean

i i

T

L

=

π

,

where L is a task laboriousness. The task laboriousness represents our priori knowledge of the task solution complexity (computational nodes that are needed for the task solution).

The computational system with the schedule is defined as following set: R =

<

π

,

h

(

t

)

>

,

π

=

(

π

_i

,...,

π

_n

),

h

(

t

)

=

(

h

₁

(

t

),...,

h

_n

(

t

))

.

The etalon efficiency of the system with the schedule is determined as sum of the etalon effi-ciencies of the computational nodes that are used for the task execution at the point of time t:

¦

= = n i i ih t t 1 ) ( ) (

π

.

The etalon execution time of the task

T

is the execution time when all computational nodes run with the etalon efficiency.

¿

¾

½

¯

®

=

³

= t t

L

d

T

0

)

(

min

arg

τ

π

.

So, the efficiency of the system R is defined as a ratio of the etalon to the real task execution time:

(6)

T

E

=

, where T is the real execution time of the task A.

The acceleration of the parallel system is defined as a ratio of the task execution time when the task is realized on the one computational node to the task execution time when the task is realized on the whole system. In the system with the schedule the computational nodes could be different that is why a relative acceleration parameter is introduced. The acceleration S of the system R1 that is relatived to the system R2 is defined as a ratio of the task execution times of these

systems: S(R1, R2) 1 2

T

=

.

And the acceleration Si is defined as a ratio of the task etalon execution time when the task is

realized on the computational node i to the task execution time when the task is realized on the whole system:

T T

S i

i= .

The relative acceleration of the system with the schedule is determined as a vector )

,..., ,

(S₁ S₂ S_n

S = .

For the distributed system R with the schedule the following ratio is established:

1 1

)

(

− =

¦

=

n i i i

S

E

ρ

, where

ρ

_i

=

ρ

_i

(T

)

.

The optimized computation graph of the task parallel execution is obtained at the end of mathematical stage. Then the programmer codes a parallel program of the task execution using the defined model of the parallel algorithm.

It is supposed that this approach to separation of mathematical and programming parts allows to reduce the execution time of complex tasks considerably. It frees the applied mathematician from necessity to write the task execution code. The mathematician could make an algorithm structure analysis, detect the algorithm parallelism and provide it in the view of model that will be understandable for the programmer. On the other hand, the programmer could write an efficient code using just a model of the parallel computing without going into the hart of the task. 4. Analyses of existent parallel technologies and engines

There is a possibility to use a known parallel algorithms that are adopted for execution in par-allel and distributed computational environments at the Task Performance Layer. This could be realized by attachment of high-optimized multithreaded mathematical operations libraries such as Math Kernel library.

The following parallel programming technologies could be used for development a parallel program for the task execution:

• MPI (Message Passing Interface) – message passing model; • HPF (High Performance Fortran) – data parallelism model;

(7)

• OpenMP – control parallelism model;

• DVM – data and control parallelism model and others.

The analysis of OpenMP usage for paralleling a sequential algorithm was made [1]. Parallel algorithm development was represented and analyzed on the example of paralleling a linear equations system solution by Kramer method. The process of parallel algorithm development started with analyses of the sequential algorithm.

The linear equations system solution by Kramer method is based on operations with matrixes. Repeating of the same computational operations for different matrix elements is characteristic of this computational method. This feature demonstrates the data parallelism in the matrix compu-ting. So, paralleling of the matrix operations is reduced and it is necessary to split the processing of matrix elements between threads. Data strip vertical division is used for matrix operations in Kramer method. Defined subset of columns is assigned for each computing thread. The advantage of this matrix division is that each parallel program thread uses only its data and it doesn’t need data which another thread processes at this moment.

This analysis showed that paralleling of computational algorithm by using OpenMP model which is optimized for multicore systems is quite efficient. The experiment illustrated that this allows to reduce an average time of computational process in 1,8.

Sometimes we have deal with large data analysis scenarios, it is not practical to use a single processor or task to scan a massive dataset or data stream to look for a pattern – the overhead and delay are too great. In these cases, one can partition the data across large numbers of processors, each of which can analyze a subset of the data. The results of each “sub-scan” are then combined and returned to the user.

The “map-reduce” pattern is frequently used technique. It is Google’s product so Google has an internal version [2]. MapReduce is parallel programming model, providing a programming model and runtime system for the processing of large datasets, and it is based on a simple model with just two key functions: “map” and “reduce” borrowed from functional languages.

The map function applies a specific operation to each of an items set, producing a new set of items; a reduce function performs aggregation on a set of items. The MapReduce runtime system automatically partitions input data and schedules the execution of programs in a large cluster of commodity machines. The system is made fault tolerant by checking worker nodes periodically and reassigning failed jobs to other worker nodes.

(8)

Microsoft represents Dryad [5] engine for large scale data-parallel computing. It is a mecha-nism to support the execution of distributed collections of tasks that can be configured into an arbitrary directed acyclic graph (DAG).

Figure 2. Dryad system architecture [5]

NS is the name server which maintains the cluster membership. The job manager is responsi-ble for spawning vertices (V) on availaresponsi-ble computers with the help of a remote-execution and monitoring daemon (PD). Vertices exchange data through files, TCP pipes, or shared-memory channels. The grey shape on the Figure 2 indicates the vertices in the job that are currently running and the correspondence with the job execution graph.

The major contributions of MapReduce model are a simple and powerful interface that ena-bles automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. A key difference between existent parallel technologies such as MPI and MapReduce is that MapReduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance. So in spite of implementing a very restricted execution engine and programming model, MapReduce has proved influential due to its simplicity and scalability at low cost.

Dryad is an execution engine that allows more general execution plans than MapReduce, and it is therefore more efficient for some applications. The fundamental difference between the two systems is that a Dryad application may specify an arbitrary communication DAG rather than requiring a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs and generate multiple outputs of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, Dryad implementation is general enough to support all the features provided by MapReduce engine.

(9)

5. Summary

Suggested three-level model for software development allows to detail the process and to make complete analysis at the each level. It is successful possibility to raise the information processing quality and performance. Using of parallel computing at the third level provides in-crease of computing rate of the separate tasks and total functional efficiency of the whole information system.

The suggested approach of separation the mathematical analyses which consists of the parallel algorithm development and the programming part allows to reduce execution time of complex tasks and consequently to rise performance of the whole system.

Sometimes it is efficiently to use the existent technologies for parallel programming, to attach the high-optimized multithreaded mathematical operations libraries or to use the engines for large datasets analysis. The choice depends on data scale that has to be processed, given computational resources, time requirements and other specified demands.

Bibliography

[1] Alieksieiev M., Iermakova K., Kushnyr V., Analysis of Mathematical Algorithms Paralleling Efficiency Using Parallel Programming Technology, CADSM 2009: International Conference “The experience of designing and application of CAD systems in microelectronics”, Lviv, Ukraine, 2009: p. 180–181.

[2] Dean J., Ghemawat S., MapReduce: Simplified Data Processing on Large Clusters, OSDI’04: Sixth Symposium on Operating Systems Design and Implementation, San Francisco, CA, 2004: p. 137–150.

[3] Gergel V., Fursov V., Parallel computing, Samara: pub. house SNAU, 2009: p. 164.

[4] Globa L., Modified model driven software development, Polish J. of Environ. Stud, Vol. 18, No. 4, 2009: p. 39–43.

[5] Isard M., Budiu M., Yu Y., Birrell A., Fetterly D., Dryad: Distributed data-parallel programs from sequential building blocks, EuroSys’07: European Conference on Computer Systems, 2007: p. 59–72.

[6] Khritankov A., Models and algorithms of load distribution, Russian J. “Information technol-ogies and computational systems”, No. 2, 2009: p. 65–80.

(10)

PODEJĝCIE DO EFEKTYWNEJ ORGANIZACJI PRZETWARZANIA RÓWNOLEGŁEGO

Streszczenie

Artykuł przestawia trzypoziomowy model rozwoju oprogramowania. Kady po-ziom modelu przewiduje specjalne analizy zada i funkcjonowania. Trzeci popo-ziom (warstwa realizacji zada) jest rozwaona głbiej. Zaproponowano matematyczne podejcie do efektywnej organizacji przetwarzania równoległego. Przeanalizowano równie wykorzystanie istniejcych technologii równoległych.

Słowa kluczowe: zmodyfikowany Model-Driven Software Development, warstwa realizacji zada, przetwarzanie równoległe, podejcie matematyczne

Larysa Globa Kateryna Iermakova

Institute of Telecommunication Systems

National Technical University of Ukraine “Kyiv Polytechnic Institute” Ukraine, 03056, Kyiv, Indusrtialnyy side street, 2, off. 316

Approach to efficient parallel computing organization

τ

¦

π

π

{ }

r

max

=

π

¦

π

π

¦

³

τ

ρ

T

>

0

π

T

L

=

π

<

π

,

h

(

t

)

>

,

π

=

(

π

,...,

π

),

h

(

t

)

=

(

h

(

t

),...,

h

(

t

))

¦

π

π

T

¿

¾

½

¯

®

­

=

=

³

L

d

T

)

(

min

arg

τ

τ

π

T

T