V-Cal: A Calculus for the Compilation of Data Parallel Language

(1)

V-cal

: a Calculus for the Compilation

of Data Parallel Languages

P.F.G. Dechering? J.A. Trescher J.P.M. de Vreught H.J. Sips

Delft University of Technology, Faculty of Applied Physics The Netherlands

Email:BoosterTeam@cp.tn.tudelft.nl

Abstract. V-cal is a calculus designed to support the compilation of

data parallel languages that allows to describe program transformations and optimizations as semantics preserving rewrite rules. InV-calthe pro-gram transformation and optimization phase of a compiler is organized in three independent passes: in the rst pass a set of rewrite rules are ap-plied that attempt to identify the potential parallelism of an algorithm. In the second pass program parts amenable to parallelization or algo-rithm substitution are replaced by their semantically equivalent parallel counterparts. Finally, a set of rules are applied that map the parallelized program to the target architecture in a way that makes ecient use of the given resources.

Data parallel languages provide a programming model that abstracts from parallelism, communication, and synchronization. To be able to ex-press optimizing transformations in V-cal parallelism, communication, and synchronization are made explicit by the means of dedicated oper-ators that represent these concepts in a machine independent way. In this paper we describe the operators and transformation rules that allow to implement the rst two passes of a V-cal engine. We show that our approach leads to a exible compiler design that allows to experiment with dierent optimization strategies and heuristics.

1 Introduction

Traditional compilers for data parallel languages advocate the `one tool does all' approach: parsing, optimizing, and code generation are strongly interweaved and they are often hard-coded in the compiler. In the compiler that we have con-structed, we have taken a dierent approach where all these phases are handled by separate tools.

There are several advantages of this approach over the `one tool does all' approach:

{

The dierent tools are easier to develop, to maintain, and to extend because they do just a single job. There is no interweaving with other tools.

(2)

{

It is easier to replace the individual tools. A change in the front end means that a dierent language can be translated. Choosing a dierent target ma-chine will mean that the back end must be changed.

The disadvantage of our approach is a little drop in speed of the compiler due to the overhead of running several tools. Each tool has to parse its input and generate its output. In our opinion the exibility of the new approach outweights the minor decrease in speed.

Input program

Front end Back end

V-nus

Engine

Target code

Fig.1.Overview of theBooster compiler

Our experimental compiler consists of three tools (see Figure 1) [Trescher 94]: a front end that parses the input program, an optimizer that operates on an intermediate language, and a back end that generates target code. Our input language is an experimental data parallel language called Booster [Breebaart 95, Paalvast 89].

The front end only parses the input program and checks the static semantics of the program. The front end does not try to do any optimizations. The output of the front end is a program in the intermediate language V-nus.

Since V-nus is the interface between all tools, we have several requirements for such an intermediate language:

{

V-nus is self contained. A V-nus program is the only information that is exchanged between the dierent tools.

{

V-nus must be suitable for the denition of a calculus.

{

V-nus must be expressive enough to describe all constructs of the input language and of the target language. Furthermore V-nus must be able to describe the topology at a high level.

{

Since all tools must parse or generate V-nus code, the language must be ex-tremely simple such that only little time is lost while parsing and generating V-nus code.

(3)

V-nus program Engine Modified V-nus program V-nus grammar V-cal rules Tool 1 Tool 2 Tool n (a) (b)

Fig.2. V-cal engine

We have dened a calculus V-cal on V-nus programs that tries to improve the performance of V-nus programs. The calculus consists of transformation rules that are to be applied by an engine to the nodes of the parse tree of a V-nus program (see the left part of Figure 2). The conditions when the application of a transformation rule is allowed, can depend on tree patterns, data ow analysis, and heuristics. Both input and resulting output are valid V-nus programs.

As an advantage of this approach the optimizing engine can be implemented by a set of dedicated tools. As illustrated by the blow up in the right part of Figure 2 each of these tools reads a V-nus program, performs a particular type of optimization and then generates a modied V-nus program. The nal V-nus program resulting from this process will then form the input of the back end. Its job is to make a direct translation from V-nus to the target code. Just like the front end it does not do any optimizations.

Before we get into the details of V-nus and V-cal, we will rst discuss related work like Fidil, F-Code, and the -calculus. The emphasis of the paper will be on the transformation rules found in V-cal. There are three kinds of transformation rules in V-cal: a set of parallelism identifying rules, a set of communication handling rules, and a set of topology dependent rules. In the nal section we will discuss our approach.

2 Related Work

Research on compilers that generate distributed code from sequential or data parallel algorithms concentrates on problems in the area of scientic computing, where shared arrays have to be distributed among the private memories of dif-ferent machines. In particular methods that extentFortran [ISO/IEC 91] by array constructs and means to specify a decomposition of an array have been studied extensively [Callahan 88] [Andre 90] [Hiranandani 92] [Zima 90].

(4)

While these approaches promise immediate advances of the state of prac-tice, it is also acknowledged that many algorithms described in a few pages of mathematics oftenly result in largeFortranprograms which make it dicult to experiment with simple variations of the algorithm, or to port such a program to a variety of architectures [Semenzato 90]. Several experimental approaches have been proposed that address these problems by more fundamental algorithmic, semantical, or linguistic considerations [Jesshope 93] [Hil inger 93] [Mullin 93].

One characteristic of these approaches is to extend the usual notion of array types to provide arrays with more general index sets, e.g. maps in Fidil, selective assignments in F-Code, Psi Reductions in the -calculus, and the view concept of Booster [Paalvast 89]. The second characteristic is a formal approach to the denition of the language semantics and compilation process.

F-Code [Jesshope 93] is a formally dened language that was developed to represent the semantics of data parallel processing, as well as data managment and control primitives of imperative languages. F-Code was designed as an in-termediate language for compilers to act as a portable software platform with the purpose to foster an architecture independent programming model. In con-trast to our work on Booster and V-cal that tries to facilitate research on novel programming paradigms and associated compiler technology, F-Code supports the rapid implementation of data parallel languages.

Like Booster, Fidil [Hil inger 93] extends the semantic domain ofFortran -like algebraic languages with facilities for construction, composition, and re-nement of index sets |called domains| and for performing computations on functions dened over these domains |called maps. Additionally, it provides capabilities for dening operators and for performing higher order operations on functions.

Fidil is an attempt to automate much of the routine book-keeping that forms a large part of many programs involving for instance PDEs, and to bring the semantic level of these programs closer to that at which the algorithms are originally conceived. However, the versatile nature of maps, like that of views in Booster, is a potential source of ineciency. A set of techniques to reduce or eliminate it have been described in [Semenzato 90], and it would be interesting to obtain performance gures on realistic programs that evaluate these techniques. In [Mullin 93] a calculus on arrays, called the -calculus, is given that is capable of transforming a high level single massively parallel operation on arrays into a low level version optimized for a given parallel architecture. The -calculus is just like V-cal based on operations acting on the indexing scheme of arrays. Given a data structure, a communication pattern, and a topology of the parallel machine the -calculus is able to compute an ecient memory mapping with respect to some cost function.

In V-cal this information is in general not known during compile time. In some cases we will need to resort to default parameters to compute a mapping. The -calculus could be embedded as an auxiliar calculus at the lowest level of V-cal. In such a framework V-cal would be used for the higher level constructs (like control ow) and establishing the parameters needed for the -calculus,

(5)

which in turn could compute a good mapping scheme. 3 The calculus V-cal

The view calculus V-cal came into existence due to the creation of the language Booster. V-cal is intended to be a calculus for supporting the compilation of a high level data parallel language like Booster. We have chosen to use an inter-mediate language, called V-nus, to express the program patterns on which the calculus will work. The calculus consists of a set of rewrite rules for V-nus. The left-hand side species a program pattern while the right-hand side denes its re-placement. We say that a rule matches a program construct if the rule is dened for this program construct. If a rule matches a program construct it will replace this program construct by that of the right-hand side. Of course the program, after rewriting, will be semantically equivalent with the original program.

V-cal can be divided into three classes of rewrite rules. The rst class of rewrite rules is the set of rules that will initially be used when V-cal is applied to a V-nus program. This set of parallelism identifying rules will be used to rewrite program constructs such that parallelism can be improved. Construct substitutions and loop distributions are examples of this class of rules. The second class of rules in V-cal will handle the use of communication statements. These communication handling rules will introduce, move or remove statements needed for the transport of data. At this level data distribution information will not be used and therefore the communication statements only specify at what point in the program data is needed, synchronization is required, or a redistribution has to be performed. Finally, in the topology dependent rules information can be used about the ownership of data, the data distribution and the topology. In this paper we will focus on the rst and second class of V-cal rules. Current research is focussed on investigating V-cal on the third level.

A calculus specication consists of a language denition (i.c. V-nus) and the use and denition of rewrite rules (i.c. the V-cal rules) which we introduce in the following sections. We aim at achieving an engine that reads a program and a specication of a calculus, and will result in a modied program. Note that the denition of the language is an integral unit of the specied calculus. This is depicted in Figure 2 of the previous section.

3.1 Denition of

V-nus

V-nusis a framework for the denotational semantics of programs written in high level programming languages, such as Booster orFortran. Each V-nus expres-sion can be represented by a function on states in the denotational semantics [Dechering 95]. Once a program is converted into V-nus we can apply the rules of the calculus V-cal in order to gain a more ecient V-nus program. We aim at the compilation for a SPMD machine and therefore the rules of V-cal will focus on the eective exploration of potential parallelism. It is not necessary to have a notion of parallelism in the program that was the source of the V-nus program.

(6)

We will now demonstrate the syntax and semantics of the intermediate guage V-nus with an example. From the set of high level programming lan-guages we are only able to compile Booster programs to V-nus, for this moment. Therefore we will present the semantics of V-nus by showing how to translate a Booster program to V-nus. The syntax of V-nus corresponds to data structures of the functional language Miranda [Turner 85]. Suppose we have the following assignment in Booster:

A [i:0..2] := 7+13;

We assume A to be a one-dimensional array. This statement assigns the value 7 + 13 to the elements A [0], A [1] and A [2]. Translating this to V-nus we get:

(s, iteration [(i,3)] [(s', assignment (A, [i]) (7,+,13))])

A statement in the V-nus language is represented by a tuple consisting of a statement handle (s) and a description of the action (iteration ...). The V-nus representation of the Booster statement is denoted as an iteration of an as-signment. The assignment uses a constructor with two items. The rst item represents the structure ((A, [i])) for which the assignment must be performed. The second one is the expression ((7,+,13)) that is assigned.

Several statements that occur in a sequence in a Booster program will result in a statement list in V-nus. Consider the following Booster program:

Vfi:ng<-A [i,i]; V||=1;

B [i:1..m-1]:=B [i-1] + B [i+1]; ITER i OVER 3 DO

P(A,i); END;

The symbol `<-' denotes that the corresponding statement is a view statement. This means that the elements V[0],...,V[n-1] are references to A[0,0],...,A[n-1,n-1] respectively. The symbol `||=' denotes a parallel assignment. The assignment is performed in such a way that no element is used as a target before it is used as a source. The symbol `:=' denotes a sequential assignment. Such an assignment is performed in a predened order of the normalized index space. In this case a lexicographical order is used. The compiler will try to convert this such that as much parallelism as possible is incorporated. Furthermore, we made the following assumptions: n and m are dened integers, A is declared as an array of dimension nby n, B is declared as an array of dimension m, and P is some procedure having two formal arguments.

The next statement list is an example of the V-nus representation of the above program fragment.

[(s1, view [(i,n)] V (A, [i,i])), (s2, forall [(i,n)]

(7)

(s3, iteration [(i,(m,-,2))]

[(s31, assignment (B, [(i,+,1)]) ((B, [i]),+,(B, [(i,+,2)])))]), (s4, iteration [(i,3)]

[(s41, procedurecall P ([(k,n)],(A, [k])) i)])]

Note that the cardinality list (for instance, the list [(i,(m,-,2))]) is normalized such that the index space, denoted by a cardinality list, starts with zero.

3.2 Eects of using

V-cal

.

Based on the language V-nus a set of transformation rules can be used in order to replace certain program constructs by semantically equivalent program con-structs. The calculus V-cal consists of a set of transformation rules and a strategy that prescribes the use of the rules. We will illustrate the use of some V-cal rules of the rst class, the parallelism identifying rules, by a small demonstration, based on an example from [Zima 90], where sequential code is transformed to code suited for the purpose of parallelism. Below the intermediate results are presented after each application of a transformation rule. The Booster program we start with is:

ITER i OVER 100 DO x := 5+i;

A [x] := B [x+1] + C [x]; E [i] := F [i+1] * A [x]; END;

Translating this to V-nus we obtain the following statement list: [(s1, iteration [(i,100)]

[(s11, assignment (x, []) (5,+,i)),

(s12, assignment (A, [x]) ((B, [(x,+,1)]),+,(C, [x]))), (s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [x])))])]

Applying scalar forward substitution to this V-nus program we get: [(s1, iteration [(i,100)]

[(s11, assignment (x, []) (5,+,i)),

(s12, assignment (A, [(5,+,i)]) ((B, [((5,+,i),+,1)]),+,(C, [(5,+,i)]))), (s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))])]

If we consider this as the whole program we may apply useless code elimination. The result is:

[(s1, iteration [(i,100)]

[(s12, assignment (A, [(5,+,i)]) ((B, [((5,+,i),+,1)]),+,(C, [(5,+,i)]))), (s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))])]

The expression evaluator can reduce some expressions at compile time such that the program may be replaced by:

(8)

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)]))), (s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))])]

By using data dependence information we can perform a loop distribution with the following result:

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)])))]), (s2, iteration [(i,100)]

[(s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))])

Again by using data dependence information we can replace both loops by a parallel loop such that we end up with:

[(s1, forall [(i,100)]

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)])))]), (s2, forall [(i,100)]

[(s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))])

3.3 The specication of transformation rules

In this section we will describe some of the V-cal rules. In order to be able to apply rules to a V-nus program we need some mechanism to express how these rules will be applied. Therefore we introduce a notion of a `strategy'. The description of a strategy prescribes which transformation rules will be used and how these are applied (order, kind of tree-walk, etc.). Most of the work for this strategy is still on going and we will hope to get inspiration from the vast body of work that has been done on attribute grammars, pattern matched program transformations, and rewrite systems.

In our opinion a strategy for V-cal encompasses the following elements: tree traversals, rule selections, and matching criteria. The engine will walk several times through the tree as is described in the tree traversals (like the traversals used in attribute grammars [Deransart 88]): e.g. in depth rst order, in breadth rst order, in a (static) sweep, or in a (dynamic) visit. Each walk/pass through the tree can be done by using a dierent traversal. Depending on the pass only a certain selection of the rules will be candidates for application to the tree nodes. Our engine tries to compute a closure of those rules when applied to the nodes of the tree. The actual application of the rule can be restricted by matching criteria. Besides that the node must match the pattern of the rule, we can also limit the times a rule is performed to a node and we can also limit the number of nodes to which a rule may be applied.

In this paper we will concentrate on the way transformation rules can be dened in V-cal. Furthermore, for the implementation of V-cal data dependence information is needed. Techniques to compute this kind of information are de-scribed in [Li 94, Zima 90] and is out of the scope of this paper.

The kind of transformation rules can be divided into three classes as is ex-plained in the introduction of Section 3. Examples of the rst class have been

(9)

presented in Section 3.2. We will start showing how we can incorporate transfor-mation rules like `loop distribution' and `construct substitution' in V-cal. Con-sider the loop distribution function LD : Statements!Statementsdened as:

LD((s, iteration cardinalities statements )), (s

0, iterationcardinalities block1 ),( s

00, iteration cardinalities block2 ) if Distributive(cardinalities, block1, block2)(ddi) (1) The italic items like cardinalities, statements etc. (except for the function name) are variables representing all possible instantiations at that place. These are dened by the grammar describing V-nus [Dechering 95]. Here statements is as-sumed to be the concatenation of block1 followed by block2. Note that splitting statements into two consecutive blocks such that an optimal loop distribution can be performed, may heavily depend on the order of the statements repre-sented by statements. This desirable order can be achieved by applying V-cal rules that reorder statement lists. The function Distributive determines whether the given loop distribution is semantically valid or not. For such a computation the data dependence information is needed which is represented by the variable ddi(see also [Zima 90]).

In the same way we can dene a transformation rule that replaces an iteration loop with a forall loop. The construct substitution function CS : Statement ! Statementwill perform such a replacement in the following way:

CS((s, iterationcardinalities statements )), (s, forallcardinalities statements )

if not (DD(cardinalities, statements, statements)(ddi) and

DU(cardinalities, statements, statements)(ddi)) (2) Here the functions DD and DU determine for the given two statement lists if a dene-dene dependence or a dene-use dependence exists respectively. As with the loop distribution rule, these dependencies can easily be computed by using the scalar analysis information. For the ease of this paper we will overload the DDand DU functions such that cardinalities is not always needed and statements may also be statement handles.

As said before, the second class of transformation rules dene where some kind of communication is needed in order to gain parallelism. This kind of par-allelism takes care of a correct use of the data needed for a certain computation. Using this class of rules we abstract from the information about the distribution of data. Also the rule that determines the tasks for the individual processors, called a computes-rule, is not available for these communication handling rules. In the third class of V-cal rules the data distribution is known and informa-tion can be obtained about the owner of certain data. Based on the ownership of data the computes-rule can be specied. For instance, we may specify the `owner computes-rule'. This means that if the data on the left-hand side of an assign-ment is owned by process then the computation of the right-hand side will

(10)

be performed by processx. So, we come up with the following communication primitives for the communication handling rules:

{

wantsh ds. This statement denotes that the data structure ds is needed for the execution of the statement indicated by sh. In the third level of V-cal this statement is translated to `send' and `receive' primitives dependent on the ownership of the wanted data.

{

synchronize sh ds. This one is needed to tell that the data structure ds is changed by statement sh. Also a synchronize is translated to `send' and `re-ceive' primitives in the third level of V-cal when the computes-rule is dened.

{

redistributedh. When redistribution of data is needed, this statement can be used to specify which data structure needs to be redistributed.

Here `sh', `ds', and `dh' stand for `statement handle', `data structure', and `data structure handle' resp. The communication primitives are introduced when the basic statements of V-nus are processed; i.e. the assignment and the view. The communication insertion function CI : Statements!Statementsdenes the V-cal rule that inserts the mentioned communication primitives as follows:

CI((s, view cardinalities dh rhs )), (s, view cardinalities dh rhs ),(s 1, redistributedh ) CI((s, assignmentlhs rhs )), (s 1, want sds 1),... , ( sn, wantsdsn ), (s, assignmentlhs rhs ), (sm, synchronizeslhs ) where the data structures ds1

;:::;dsn are used in rhs.

(3)

The next set of rules we need consists of moving the communication primitives. An improvement of ecient parallelism can be achieved by reducing the number of communications without reducing the grain of parallelism. For instance, we can lift the want statements out of a forall loop and let them execute before en-tering the loop. The synchronize statements can be executed when the loop has nished. The V-cal rule for a communication lift CL : Statements!Statements will do the job.

CL((s

1, forall cardinalities statements )) , (s 0, want sd 0 ), ( s

1, forallcardinalities statements 0 ) if Occurs((s

0, want

s d), statements) where statements0 = statements

n(s 0, want sd) andd 0 = cardinalities d. (4) The variable statements0 represents the statement list statements without the statement wantsd. The data structured

0 represents all elements of

d that are referenced by cardinalities. The function Occurs checks whether the rst argu-ment appears in the second. The synchronize stateargu-ments can be moved backwards in the same way such that CL is also dened as

(11)

CL((s

1, forall cardinalities statements )) , (s

1, forallcardinalities statements 0 ), ( s 2, synchronize sd 0 ) if Occurs((s 2, synchronize sd), statements) where statements0 = statements

n(s 2, synchronize sd) andd 0 = cardinalities d. (5) When a synchronize statement or a want statement is found in a forall loop it can unconditionally be moved outside the loop. The semantics of an iteration loop re-quires an extra check on dependencies between the statements in the loop before a communication statement can be put before or after the loop. Suppose a com-munication statementcwithin an iteration loop is necessary for the transport of some data structured. Saydis changed by a statements. Thencmay be placed outside the loop if d will not be used by another statement (instance) in the loop. The rule performing such a replacement of the communication statements must therefore check for a dene-use dependence as follows:

CL((s

1, iteration cardinalities statements )) , (s 0, want sd 0 ), ( s

1, iteration cardinalities statements 0 ) if Occurs((s

0, want

sd), statements) and there does not exist ans

0

2statementssuch that DU(cardinalities, s,s

0)(ddi) wheresas well ass

0 are statement handles and statements0 = statements

n(s 0, want sd) andd 0 = cardinalities d. (6)

In the same way we can dene the transformation for placing a synchronize after the nishing of the iteration loop.

CL((s

1, iteration cardinalities statements )) , (s

1, iterationcardinalities statements 0 ), ( s 2, synchronize sd 0 ) if Occurs((s 2, synchronize sd), statements) and there does not exist ans

0

2statementssuch that DU(cardinalities, s,s

0)(ddi) wheresas well ass

0 are statement handles and statements0 = statements

n(s 2, synchronize sd) andd 0 = cardinalities d. (7)

Note that the dene-use dependence check may be too restrictive for some loops. In case the `dene' and `use' are carried out by the same process, it is possible to place the communication statements outside the loop. However, the function CLthen has to check for the owner of the specic data. This kind of program transformations will not be performed at this level of the calculus.

We are aiming at large program pieces that can execute in parallel without any interaction. In order to get this we want to use rules that separate a want-synchronizepair as far as possible. So, next we present how want and synchronize statements may skip over statements. For now, we only focus on an upward

(12)

move of the want statements in the V-nus program, and a downward move of the synchronizestatements. We dene the communication move rule CM: Statements !Statementsas follows: CM((s 1,statement ),( s 2, want sd)), (s 2, want sd),(s 1,statement ) if not DU(s 1, s 2)(ddi) (8) The opposite move for a synchronize is then dened as:

CM((s 1, synchronize sd),(s 2, statement )) , (s 2,statement ),( s 1, synchronize sd) if not DU(s 1, s 2)(ddi) (9) Since a synchronize statement may cause storing a value of some data structure in a certain memory location, we see the data structure, used by a synchronize, as a `dene' of that data structure. Of course it is necessary to check for a dene-use dependence in the above V-cal rule.

3.4 The application of transformation rules

Now that we have dened a strategy, we can apply the rules to a V-nus program. The tree-walker will try to match a given rule one or several times to parts of the program. Based on the type of a part of the V-nus program and the signature of the V-cal rules it can be decided whether applying a rule makes sense or not. The example of using V-cal, which was given in section 3.2, resulted in:

(s1, forall [(i,100)]

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)])))]), (s2, forall [(i,100)]

[(s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)]))))

This example can now be extended by using a strategy that prescribes to use a `communication insertion' (3) for each statement in the program and then lift the wantstatements out of the loop by using V-cal rule (4), and lift the synchronize statements by using V-cal rule (5). An intermediate result is:

(t1, want s12 ([(i,100)],(B, [(6,+,i)]))), (t2, want s12 ([(i,100)],(C, [(5,+,i)]))), (s1, forall [(i,100)]

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)])))]), (t3, synchronize s12 ([(i,100)],(A, [(5,+,i)]))), (t4, want s13 ([(i,100)],(F, [(i,+,1)]))), (t5, want s13 ([(i,100)],(A, [(5,+,i)]))), (s2, forall [(i,100)]

[(s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))]), (t6, synchronize s13 ([(i,100)],(E, [i])))

(13)

To increase the grain of parallelism the want statement t4 may be skipped over the statements t3 and s1 as is dened by the `communication move' rule (8). The above program will then be transformed into the nal form:

(t1, want s12 ([(i,100)],(B, [(6,+,i)]))), (t2, want s12 ([(i,100)],(C, [(5,+,i)]))), (t4, want s13 ([(i,100)],(F, [(i,+,1)]))), (s1, forall [(i,100)]

[(s12, assignment (A, [(5,+,i)]) ((B, [(6,+,i)]),+,(C, [(5,+,i)])))]), (t3, synchronize s12 ([(i,100)],(A, [(5,+,i)]))), (t5, want s13 ([(i,100)],(A, [(5,+,i)]))), (s2, forall [(i,100)]

[(s13, assignment (E, [i]) ((F, [(i,+,1)]),*,(A, [(5,+,i)])))]), (t6, synchronize s13 ([(i,100)],(E, [i])))

In this example only forall loops were involved. A more complex demonstration of V-cal rules handles an iteration loop. In the next example we show a replace-ment of some communication statereplace-ments out of an iteration loop. Consider the following V-nus program:

(s1, iteration [(i,100)]

[(s11, assignment (A, [i]) (B, [i])),

(s12, assignment (C, [(i,+,1)]) ((C, [i]), +, (C, [(i,+,2)])))])

Inserting the communication statements into this program by using the V-cal rule (3) will result in:

(s1, iteration [(i,100)] [(t1, want s11 (B, [i])),

(s11, assignment (A, [i]) (B, [i])), (t2, synchronize s11 (A, [i])), (t3, want s12 (C, [i])),

(s12, assignment (C, [(i,+,1)]) ((C, [i]), +, (C, [(i,+,2)]))), (t4, synchronize s12 (C, [i]))])

One can easily verify that no dene-use dependence exists between s11 and any other statement or statement instance. Though there is a dene-use dependence between the statement instances of s12. So, applying both rule (4) and (5) once to this program we see that the communication statements for s11 can be placed outside the loop. Those for s12 cannot due to the dependencies. In the loop body we can use rule (9) such that the assignments can be executed sequentially without interference of a communication primitive. Using the V-cal rules will therefore lead to:

(t1, want s11 ([(i,100)], (B, [i]))), (s1, iteration [(i,100)]

(14)

(s11, assignment (A, [i]) (B, [i])),

(s12, assignment (C, [(i,+,1)]) ((C, [i]), +, (C, [(i,+,2)]))), (t4, synchronize s12 (C, [i]))]),

(t2, synchronize s11 ([(i,100)], (A, [i])))

Note that the want statement in the loop body only retrieves one element at a time; the want statement that is placed out of the loop body now retrieves all the elements (B, [i]) that will be used in the subsequent iteration. In a similar way the synchronize statement in the loop diers from the one outside the loop. 4 Discussion

In this paper we have described the compiler based on V-nus and V-cal. The compiler is structured into three independent tools. The exibility of this com-piler design has several advantages.

Since Booster uses abstractions known from many high level data paral-lel languages it should be straightforward to use our compiler as a back end for a language as HPF [HPFF 93]. The V-cal engine is written in TXL (see [Cordy 93]) which allows rapid prototyping. This means that we can `plug & play' with transformation rules and with the strategy in which order these rules are applied. Rewriting our compiler for a dierent target machine involves rewrit-ing the back end and most likely extendrewrit-ing the topology dependent rules. Since V-cal rules preserve semantics, existing rules (if properly dened) need not be replaced.

In the near future we want to extend the parallelism identifying and the communication handling rules, and we also like to make a start with the topology dependent rules. We will perform several theoretical and practical tests on the set of rules that we have developed. The general parallelism identifying rules can be tested with the results found in the literature. Beside the theoretical comparison the entire set of rules needs to be tested on a large number of programs relevant to the eld of computational science.

We hope that our compiler will be used by the research community as a tool to experiment with dierent transformation rules and strategies.

References

[Andre 90] F. Andre, J. Pazat, H. ThomasPandore: A System to Manage Data Dis-tribution, in Proc. 1990 Intl. Conf. on Supercomputing, The Netherlands, 1990. [Breebaart 95] L.C. Breebaart, P.F.G. Dechering, A.B. Poelman, J.A. Trescher, J.P.M.

de Vreught, and H.J. Sips,The Booster Language, A Working Paper 1.0, Compu-tational Physics report series CP{95{02, Delft University of Technology, 1995. [Callahan 88] D. Callahan and K. Kennedy Compiling Programs for

Distributed-Memory Multiprocessor, The Journal of Supercomputing, 2:151-169, 1988.

[Cordy 93] J. Cordy, I. Carmichael, The TXL Programming Language, Syntax and Informal Semantics, Version 7, Department of Computing and Information Science, Queens University at Kingston,txl@qucis.queensu.ca, 1993.

(15)

[Dechering 95] P.F.G. Dechering,The Denotational Semantics of Booster, Computa-tional Physics report series CP{95{05, Delft University of Technology, 1995. [Deransart 88] P. Deransart, M. Jourdan, B. Lorho,Attribute grammars, denitions,

systems and bibliography, vol. 323 of Lecture Notes in Computer Science, Springer Verlag, 1988

[Hiranandani 92] S. Hiranandani et. al. Compiling Fortran-D for MIMD Distributed-Memory Machines, Communications of the ACM, 35(8):66-80, August, 1992. [Hil inger 93] P. Hil inger P. ColellaFidil Reference Manual, Report No. UCB/CSG

93-759, 1993.

[HPFF 93] High Performance Fortran Forum, High Performance Fortran, Language Specication, Version 1.0, Rice University, Houston, Texas, 1993.

[ISO/IEC 91] ISO/IECInformation technology | Programming languages | Fortran, ISO/IEC standard 1539, 1991.

[Jesshope 93] C. Jesshope et. al.F-code and its implementation: a Portable Software Platform for Data Parallelism, in Proc. 4. Intl. Workshop on Compilers for Parallel Computers, Delft 1993, The Netherlands.

[Li 94] J. Li and M. Wolfe. Dening, Analyzing and Transforming Program Con-structs. IEEE Parallel and Distributed Technology, pages 32{39, 1994.

[Mullin 93] L.M.R. Mullin, D.R. Dooling, E.A. Sandberg, and S.A. Thibault.Formal Method in Scheduling, Routing, and Communication Protocol, Fourth International Workshop on Compilers for Parallel Computers, Delft University of Technology, 1993.

[Paalvast 89] E. Paalvast, H. SipsA High-level Language for the Description of Parallel Algorithms, in Proc. of Parallel Computing '89, North Holland Publ., 1989. [Semenzato 90] L. Semenzato and P. Hil ingerArrays in Fidil, in: L. M. R. Mullin, M.

Jenkins, G. Hains, R. Bernecky, G. Gao,Arrays, Functional Languages, and Parallel Systems.

[Trescher 94] J.A. Trescher, P.F.G. Dechering, A.B. Poelman, J.P.M. de Vreught, and H.J. Sips,A Formal Approach to the Compilation of Data Parallel Languages, in K. Pingali, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, pages 155 { 169, Springer-Verlag, 1994. [Turner 85] D. Turner. Miranda: a Non-strict Functional Language with Polymorphic

Types. In J.P. Jouannaud, editor,Functional Programming Languages and Com-puter Architecture, volume 201 of Lecture Notes in Computer Science. Springer-Verlag, 1985.

[Wolfe 89] M. WolfeOptimizing Supercompilers for Supercomputers, MIT Press, Cam-bridge, Massachusetts, 1989.

[Zima 90] H. Zima and B. ChapmanSupercompilers for Parallel and Vector Comput-ers, ACM Press, 1990.

V-Cal: A Calculus for the Compilation of Data Parallel Language

: a Calculus for the Compilation

of Data Parallel Languages

{

{

{

{

{

{

3.1 De nition of

3.2 E ects of using

.

3.3 The speci cation of transformation rules

{

{

{

3.4 The application of transformation rules

3.1 Denition of

3.2 Eects of using

3.3 The specication of transformation rules