Analysis of Enumeration and Storage Schemes in HPF

(1)

Analysis of local enumeration and storage schemes in HPF

Henk J. Sips, Kees van Reeuwijk, Will Denissen

†

Delft University of Technology

∗

Lorentzweg1, 2628 CJ Delft, the Netherlands

†

_TNO-TPD

PO Box 155, 2600 Delft, the Netherlands

Abstract

In this paper, we analyze the efficiency of three local enumer-ation and three storage compression schemes for cyclic(m) data distributions in High Performance Fortran (HPF).We show that for linear array access sequences, efficient enu-meration and storage compression schemes can be derived. Furthermore, local enumeration and storage techniques are shown to be orthogonal, if the local storage compression scheme is collapsible.Performance figures of the methods are given for a number of different processors.

1 Introduction

In data parallel languages like High Performance Fortran (HPF) [1], the optimization of SPMD message passing pro-grams is an important research topic.Especially, the deter-mination of local iteration and communication sets is crucial for obtaining performance of parallelized programs.With

cyclic(m) data distributions and non-trivial data alignment

functions, storage eﬃciency might also be a problem. Several solutions for the local set problem and the stor-age problem have been proposed.We will review the main methods (Section 3 and Section 4) and show that for linear access sequences, local set enumeration can be done very eﬃciently and can be freely combined with a number of lo-cal storage strategies (Section 5).Finally, in Section 6 the performance of local enumeration schemes on a number of processors is given.

2 The problem

Consider the following HPF program, where we have a data

object A, which is aligned to a template T.The template T

is distributed cyclic(8)over a two-processor arrangement

P.

real A(43)

!HPF$ template T(10000) !HPF$ processors P(0:1) !HPF$ align A(i) with T(3*i)

!HPF$ distribute cyclic(8) onto P :: T

forall (i=0:21) A(2*i) = ...

The data layout of this program is visualized in Fig.1. The small dots are template elements, the black dots are array

∗_{This research is sponsored by SPIN and Esprit.}

elements of A, and the boxed dots denote array elements which take part in the computation.

To describe local enumeration and local storage schemes, we will use several functions.First of all, we have an align-ment function fal(i), which describes the ultimate alignment

of an array object to another array or template object that is distributed.Furthermore, the array access function is

de-noted as fix(i).The composition of alignment and access

function is given by fco= fal◦fix.In the example program,

fal(i) = 3· i, fix(i) = 2· i, and fco(i) = 6· i.

We will also frequently use linear variants of the functions

fix, fal, and fco and denote them as fix(i) = aix· i + bix,

fal(i) = aal·i+bal, and fco(i) = aco·i+bco,

respectively.Fur-thermore, two gcd functions are used: galand gco, which are

deﬁned as gal= gcd(aal, m· np) and gco= gcd(aco, m· np),

respectively.For the remainder of the paper, we summarize some deﬁnitions in Table 1.

The relation between processor number p and global it-eration index i is given by the position equation [12]:

fco(i) = m· np· r + m · p + c (1)

with

r = fco(i)/(m· np) (2)

c = fco(i) mod m (3)

p = fco(i)/m mod np (4)

where c is the column number, r is the row number, m is the block size, npis the total number of processors, and p is

the processor number.The function in (4) is also called the

owner function.

2.1 Local set enumeration

Local execution set enumeration is ﬁnding the sequence of local loop iterations for executing an assignment statement involving distributed data structures and using the owner

computes rule.Given this deﬁnition, we can now transform

the example global program to a parameterized local pro-gram as follows:

(2)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 template element

array element aligned to template selected array element

P 0 P1 c r i = 0 i = 1 i = 3 i = 4 i = 6 i = 8 i = 2 i = 7 i = 5

Figure 1: Distribution of example program

do i=0:21

if owner(f co(i))=me then

A(2*i)=....

end if end do

where meis a variable holding the number of the processor

the program is running on and the function owner(f co(i))

is given by (4).This method is called guarded execution. Determining the local execution set in this way can be very ineﬃcient, since many tests may return ’false’.The local execution set can be more eﬀectively generated if the

predicate owner(f co(i))=meis removed.For regular index

expressions, this can be achieved by splitting the original regular index set in one or more regular index subsets which cover the local execution set and for which the predicate

owner(fco(i)) = me always yields ’true’.In practice, it

turns out that this can be achieved by ﬁnding an new func-tion i = gen(j0, j1) and iterating over the ranges of j1 and j0 as follows: do j1=j1 min(me):j1 max(me) do j0=j0 min(me):j0 max(me) A(2*gen(j0,j1))=.... end do end do

The local enumeration problem is to ﬁnd an expression for the generating function gen(j0, j1) and to determine the ap-propriate ranges of j1 and j0.

c The column number.

r The row number.

g The greatest common divisor.

m Block size in cyclic(m).

n Required local storage for a distributed array.

ni Upper bound of the global iteration index i

nr The number of rows.

np Number of processors

p Processor number.

Table 1: Deﬁnitions.

2.2 Local storage

If only the local part of a distributed array is to be stored on a processor, a local storage scheme is needed.As a result, a global array access must be transformed to local array access.This is done by introducing a global-to-local function

g2l(i), transforming the assignment statement to: A(g2l(2*gen(j0,j1)))=....

For the function g2l there are many choices.First of all, it is important that the function can be evaluated eﬃciently, since it is invoked for every local array access.Secondly, it must yield a reasonable eﬃcient local storage compression scheme.With that we mean that the number of allocated, but unused storage elements should be limited.

3 Local set enumeration methods

In this section, three diﬀerent local set enumeration methods are reviewed: rowwise enumeration, columnwise enumera-tion, and pattern cyclic enumeration.Basic formulas for rowwise and columnwise enumeration have been derived in various forms in previous papers [3, 7, 5, 8, 12, 13, 14].Here we use the framework of [12].Pattern cyclic enumeration is due to Chatterjee et al.[2].

3.1 Rowwise enumeration

Rowwise enumeration is achieved by decomposing the posi-tion equaposi-tion (1) in the loop order (p, r, c), i.e., the column number is generated in the innermost loop.Rowwise enu-meration is described by the following formulas:

bco/m· np ≤ j1< nr (5)

0≤ −m · np· j1− m · p + fco(j0) < m (6)

genrow(j0, j1) = j0 (7)

with nr =fco(imax)/(m· np) and imin= 0 (normalized).

Loops can be straightforwardly generated by solving j1 and

j0 out of the inequalities (5) and (6).In the example, row-wise generation generates for processor P0 the global itera-tion sequences (0,1);(3);(6);(8,9); etc.Loop variable bounds for j0 and j1 become 0≤ j1 < 8 and 0≤ −16· j1+ 6· j0 < 8, respectively, for processor P0.Note that for rowwise gener-ation no gcd calculgener-ations are necessary.

3.2 Columnwise enumeration

Columnwise enumeration is achieved by decomposing the position equation (1) in the loop order (p, c, r), i.e., the row

(3)

number is generated in the innermost loop.Columnwise generation is described by the following formulas:

0≤ bco− m · p + gco· j1< m (8) 0≤ i0· j1+ ∆ico· j0< ni (9)

gencol(j0, j1) = i0· j1+ ∆ico· j0 (10) where i0is a particular solution of the Diophantine equation aco· i − m · np· j0= gcoand ∆ico= m· np/gco, with gco=

gcd(aco, m· np).

Again loops are generated by solving j1and j0out of the inequalities (8) and (9).In the example, columnwise gener-ation yields the global index sequences (0, 8, 16); (3, 11, 19); (6, 14); (1, 9, 17); etc for processor P0.The generating equa-tions are given by 0 ≤ j1 < 4, 0 ≤ 3 + 8 · j0 < 22, and i = gencol(j0, j1) = 3· j1+ 8· j0 (the particular solution is i0= 3).

3.3 Pattern cyclic enumeration

The pattern cyclic method is due to Chatterjee et al.[2]. The method is based on using a repeating pattern in the ac-cess of local array elements.Such a pattern can be observed in Fig.1. In processor P0, the array elements {0, 2, 6, 12} form a cycle, which is repeated thereafter.It can also be observed that a pattern cycle consists of a sequence of con-secutive array elements having a diﬀerent column number.

If we denote the number of elements per cycle as ECco, we

can observe that in the example ECco= 4 for processor P0. The number of element in a pattern having a diﬀerent column number is equal to the number of columns in the columnwise enumeration scheme described in the previous

section.The number of columns is given by solving j1 out

of (8), yielding (m · p − bco)/gco ≤ j1 < (m · p + m − bco)/gco.Then ECcois found by subtracting both bounds,

giving ECco=(m · p + m − bco)/gco − (m · p − bco)/gco.

Pattern cyclic enumeration leads to the following equations:

0≤ j1< Nc (11)

0≤ j0< ECco(j1) (12) genpc(j0, j1) = j1· ∆ico+ GTBco(j0) + is (13)

where Ncis the number of pattern cycles covering the local

subset of the index space, is is the ﬁrst global index value

in the local range, and GT Bco is a table mapping j0 to a

relative global index value i within a cycle.The number of elements per cycle ECcoin (12) is parameterized by j1.This is because the last cycle may not be a complete cycle.

The values of the table GTBcocan be found by taking

the ﬁrst ECco global indices in the rowwise enumeration

scheme from Section 3.1, followed by a subtraction of the value of the ﬁrst element is, to make all values in the table

relative to the starting point of the cycle.

For the example: for processor P0 the ﬁrst ECcoglobal

indices are 0, 1, 3, 6, and is = 0.Hence, the table GTBco

is given by GTBco: [0, 1, 2, 3]→ [0, 1, 3, 6].The number of

cycles Nc = 3 and the last cycle contains 3 elements.The

global index generating equation is equal to genpc(j0, j1) = 8· j1+ GTBco(j0).

Chatterjee et al.use an incremental version of the

ta-ble GTBco (the ∆M table in [2]), which can be directly

constructed from GTBco.Basically there is no diﬀerence,

apart from the fact that the GTB form allows out of order execution of loop iterations.

There are various ways to implement the outer loop

de-ﬁned in (11).If we can calculate Nc and the number of

elements in the last cycle in advance, implementation fol-lows directly from (11) and (12).In Section 4.3 it is shown how these numbers can be derived.

Another way is to test the global index generated by

(13) against the upper bound ni in each iteration.Then

the outer loop variable j1 can be taken free running and

the inner loop variable j0 can be taken with a ﬁxed range

ECco.Which method is to be preferred depends on the

architectural properties of the underlying processor. 4Global-to-local functions

The global-to-local function g2l can take many forms.We will evaluate three forms: rowwise storage compression, co-lumnwise storage compression, and pattern cyclic storage compression.Formulas for rowwise and columnwise com-pression can be derived from the position equation and are given without proof.A complete proof can be found in [12]. The pattern cyclic compression scheme will be related to columnwise compression.

4.1 Rowwise compression

In rowwise compression, we compress the rows of the two dimensional index space resulting from a cyclic(m) distri-bution on an array dimension.Row compression leads to the following general global-to-local function:

g2lrow(i) = fal(i) m· np ·m aal + fal(i) mod m aal (14) For the example program, rowwise compression is shown in Fig.2a. The rowwise compression function is given by

g2lrow(i) =3 · i/16 · 3 + (3 · i mod 8)/3.The ﬁgure also

shows that in this example, some unused local storage space (i.e., holes) is allocated.

4.2 Columnwise compression

For columnwise compression, we have the following compres-sion scheme: g2lcol(i) = fal(i) m· np· ∆ral · m gal + fal(i) mod m gal (15) where gal= gcd(aal, m· np) and ∆ral= aal/gal.The term

∆ralcan be interpreted as the distance in row numbers

be-tween the column elements in the template space.For the example, we have gal= gcd(3, 8·2) = 1 and ∆ral= 3/1 = 3.

An alternative formulation for columnwise compression is given by the following formula:

g2lcolalt(i) = _i ∆ial · m gal + fal(i) mod m gal (16) where ∆ial= m· np/gal.The term ∆ialcan be interpreted

as the distance in terms of the global array index between

(4)

In the example, both (15) and (16) lead to the same compression scheme as depicted in Fig.2b. The global to local function becomes in both cases g2lcol(i) =i/16 · 8 +

(3· i) mod 8.

4.3 Pattern cyclic compression

Also local array elements have a cyclic pattern.In fact, this is equivalent to taking an enumeration scheme with

fix(i) = i and ranging over the whole dimension.In

Fig.1 it is shown that for processor P0, the array elements {0,1,2,6,7,11,12,13} form a cycle, which is repeated

there-after.In principle, we could construct a table GTBal, which

relates local array indexes to global array indexes.Note that by doing so, we are able to store array elements in consecutive order without any holes.For the example this yields GTBal : [0, 1, 2, 3, 4, 5, 6, 7] → [0, 1, 2, 6, 7, 11, 12, 13].

The inverse table of GTBalgives the required mapping from

global index to local array element.However, using this ta-ble is not very eﬃcient.Tata-ble entries of the inverted tata-ble can be very widely spaced apart.

Instead the term (fal(i) mod m)/gal in (15) deﬁnes

the relative column number in the columnwise compression scheme.By relating this number to the actual in-order po-sition of the associated local array element by means of a

permutation table PTBal, we obtain the following mapping

scheme: g2lpc(i) = fal(i) β · ECal+ PTBal fal(i) mod m gal (17)

where β = m· np· ∆ral, PTBal denotes the permutation

table, and ECal denotes the number of elements in a

cy-cle.For processor P0, the compression function becomes

g2lpc(i) = PTBal((3· i) mod 8) + 8 · i/16.

The pattern cyclic storage scheme is illustrated in Fig.2c. From the ﬁgure it can be seen that the permutation ta-ble is in this case equal to PTBal : [0, 1, 2, 3, 4, 5, 6, 7] →

[0, 5, 3, 1, 6, 4, 2, 7].Note that the term ECalgives a slightly

more accurate bound than the constantm/gal in the

col-umn compression scheme.Hence it can also be used in the columnwise compression scheme.

Table construction. To construct the permutation table

PTBal in (17), we take an enumeration scheme with the

identity index access function.From this a table GTBalis

constructed according to the scheme outlined in the Section

3.3. Then for each value i = GTBal(j0) in the basic

cy-cle, the term t = (fal(i) mod m)/galis calculated giving the

relative column number.Now each table entry is found by setting PTB (t) = j0.

The algorithm is linear with the number of elements in a cycle.It uses a linear rowwise enumeration, followed by a simple function mapping for each element in the cycle.

In the example rowwise enumeration gives the map-ping GTBal : [0, 1, 2, 3, 4, 5, 6, 7] → [0, 1, 2, 6, 7, 11, 12, 13].

Substitution of the output values in in the term

(fal(i) mod m)/galgives the mapping [0,1,2,6,7,11,12,13]→

[0,3,6,2,5,1,4,7].Relating this output to the original

input of GTBal yields PTBal : [0, 3, 6, 2, 5, 1, 4, 7] →

[0, 1, 2, 3, 4, 5, 6, 7].The table PTBalis eﬃcient, because it

allows direct mapping and contains no unused entries. Chatterjee et al.[2] use a formula that is very similar to

(17).Instead of the PTBaltable, they use the NUMARCS

0 1 2 7 6 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 (a) row 0 16 32 11 27 6 22 38 1 17 33 12 28 7 23 39 2 18 34 13 29 (b) column 0 1 2 6 7 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 (c) pattern cyclic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Figure 2: Various storage compression schemes: (a)

row-wise, (b) columnrow-wise, and (c) pattern cyclic.

table.The diﬀerence is that the NUMARCS table always has size m and hence may have unused entries.

Set sizes. To allocate buﬀer space or to calculate the num-ber of cycles Ncin an enumeration set, the size of the local

set must be calculated.

The size of the local array can be found by taking the identity function as index access function, i.e., fix(i) = i

and ranging over the whole global array, i.e., ni= 43.

The global-to-local function g2lpc(i) in (17) has as

prop-erty that all elements in a cycle are consecutively stored in

local memory.Hence, a cycle occupies ECal memory

ele-ments.The location of the last local element is found by substituting i = genrow(j0,max, j1,max) from the rowwise enumeration scheme into g2lpc(i).However, for some

distri-bution schemes the last row may be empty.Then the row variable j1must be decremented until a row containing ele-ments has been found.Since the ﬁrst local element is stored on address 0 by default, the local storage size is given by

n = g2lpc(genrow(j0,max, j1,max)) + 1 (18)

In the example, j1,max= 7 and j0,max= 39.Therefore,

g2lpc(j0) = g2l(39) = 20.Hence n = 21.

The number of cycles is then equal ton/ECal and the

number of elements in the last cycle to n− (Nc− 1) · ECal.

To determine the size of the local enumeration set, we can simply set the alignment function to fco.In this way, we

emulate that only the enumerated elements are stored.For

this method a new table PTBcomust be constructed.Then

the same scheme as outlined above can be used to calculate the size of the local enumeration set.

5 Combination of enumeration and storage schemes Thus far, local enumeration and local storage compression have been treated independently.This means that any lo-cal enumeration scheme can be freely combined with any of the local storage compression schemes.The global-to-local functions from Section 4 are relatively computation inten-sive, since in the general case they contain mod and div operations.However, for single local array accesses this is necessary.

(5)

P₀ c r 0 1 2 3 4 5 6 15 16 17 18 19 20 21 22 23 24 25 0 1 2 3 4 5 6 15 16 17 18 19 20 21 22 23 24 25 P 0 (a) (b)

Figure 3: Example of a collapsible (a) and non collapsible

(b) storage compression

Fortunately, for linear iteration sequences, the global-to-local calculations can be reduced to linear expressions in the inner loop.We will show that this property holds for any combination of local enumeration and local storage scheme, provided the local storage compression scheme is collapsible. 5.1 Collapsible storage compression schemes

For collapsible storage compression schemes, the following deﬁnition is given:

Deﬁnition 1 A storage scheme is said to be collapsible, if

it is bijective and after compression the storage distance re-mains constant between (a) any two successive array ele-ments having the same row number and (b) any two succes-sive array elements having the same column number, where row and column numbers are deﬁned according to the posi-tion equaposi-tion (1).

From this deﬁnition, we will show that the storage com-pression schemes (14), (15), and (17) are collapsible, but the compression scheme (16) is not.That the scheme (16) is non-collapsible is shown by providing a counterexample as depicted in Fig.3. This ﬁgure shows the layout of local

array elements for processor P0 with the alignment

func-tion fal = i + 1.Fig.3a shows the compression resulting

from (15) (in eﬀect in this example after compression all el-ements remain in the same position) and Fig.15b shows the compression resulting from (16).Clearly, the distance be-tween element 15 and 16 has changed form 1 to 9, while the other successive elements in that row still have a distance 1. To prove that the three other storage compression schemes are collapsible and to derive the associated linear expressions, we write the g2l functions (14), (15), and (17) as

g2lrow(i) = T1· C1+ T4 g2lcol(i) = T2· C2+ T5 g2lpc(i) = T3· C3+ T6

where T1=fal(i)/m· np, T2= T3=fal(i)/∆ral· m · np,

T4 = fal(i) mod m/aal, T5 = fal(i) mod m/gal, T6 = PTBal(T5), C1=m/aal, C2=m/gal, and C3= ECal.

5.2 Rowwise enumeration and row, column, or pattern cyclic storage

Looking at (6) we see that for the enumeration of j0it holds that m·k ≤ aco·j0+ bco< m·(k+1) for k ∈ Z.This implies

that for fco(j0) = aco· j0+ bco, it holds thatfco(j0)/m = k.We rewrite the range of variable j0 ∈ [j0,min, j0,max to j0,min+ u, where u ∈ [0, j0,max− j0,min.Using the

AA

0 1 2 7 6 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 srow

AA

A

AA

0 16 32 11 27 6 22 38 1 17 33 12 28 7 23 39 2 18 34 13 29 scol

AA

A

AA

0 1 2 6 7 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 spc rowwise enumeration 6 2 2

Figure 4: Rowwise enumeration combined with rowwise,

columnwise, and pattern cyclic storage compression

property that (x + ∆x) mod m = x mod m + ∆x, if(x +

∆x)/m = k and x/m = k, it follows that T4 can be

written as T4=(fco(j0,min) mod m)/aal + aix· u.

For columnwise storage the reasoning is completely anal-ogous for T5, yielding a multiplier of aco/galinstead of aix.

From fco(j0)/m = k, it also follows that both terms T1 and T2 are constant for any value of j0 within that row. Therefore, we may take j0,minto calculate the terms.Hence, it follows for rowwise enumeration:

g2lsrowerow(fix(j0)) = g2lrow(fix(j0,min)) + aix· u

g2lerowscol (fix(j0)) = g2lcol(fix(j0,min)) + (aco/gal)· u

where g2lerow

srow means rowwise enumeration combined with

rowwise storage and g2lerow

scol means rowwise enumeration

and columnwise storage.

For pattern cyclic storage, it holds that due to the permu-tation table, we have the row elements stored consecutively. Hence, once the starting point has been calculated; the next element in the row can be found at distance aix.Therefore,

g2lerowspc (fix(j0)) = g2lpc(fix(j0,min)) + aix· u (19)

The principle is illustrated in Fig.4. For both rowwise and pattern cyclic storage we have a distance of aix= 2 and

for columnwise storage we have a distance of aco/gal = 6

between row elements.

5.3 Columnwise enumeration and row, column, or pattern cyclic storage

For columnwise enumeration and columnwise storage, we rewrite the global index generation function gencol(j0, j1) = i0· j1+ ∆ico· j0 from (9) as gencol(j0, j1) = λ + ∆ico· u,

where λ = i0· j1+ ∆ico· j0,mindenotes the starting point of the enumeration sequence. T2 = fal(i)/∆ral· m · np

can be rewritten as T2 = fal(i)/∆ial· aal.Substituting

λ + ∆ico· u in T2 yields fco(λ + ∆ico· u) aal· ∆ial = fco(λ) aal· ∆ial + aix· ∆ico ∆ial· u

(6)

A

AA

A

AA

A

0 1 2 7 6 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 srow columnwise enumeration

AA

0 16 32 11 27 6 22 38 1 17 33 12 28 7 23 39 2 18 34 13 29 scol

AA

0 1 2 6 7 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 spc 8 8 8 8 9 9

Figure 5: Columnwise enumeration combined with rowwise,

columnwise, and pattern cyclic storage compression

because aix· (∆ico/∆ial) always yields an integer.

It holds that ∆ico= m·np/gco.Hence fco(λ+∆ico·u) =

fco(λ)+aco(m·np/gco)·u.Because the term aco·(m·np/gco)

is always a multiple of m, the term T5 will always yield a

constant (as will T4 in the other case).As a result, the

global-to-local function can be written as

g2lecolscol(αcol) = g2lcol(fix(λ)) + aix·∆ico

∆ial · C2· u

where αcol= fix(gencol(j0, j1)).The derivation of column-wise enumeration with rowcolumn-wise storage is analogous, yield-ing

g2lecolsrow(αcol) = g2lrow(fix(λ)) + aco·

∆ico

m· np· C1· u

For pattern cyclic storage it holds that the distance be-tween column elements is always a multiple of the basic cy-cle distance and is constant.Therefore, the linearization is similar to columnwise storage and yields

g2lecolspc(αcol) = g2lspc(fix(λ)) + aix·

∆ico

∆ial · C3· u

Hence, all three global-to-local functions are linear in

u.In Fig.5 the linear sequences are shown for the

ex-ample on processor P0.For columnwise storage, we have

aix· (∆ico/∆ial)· C2= 8.Since C3= ECal= 8, for pattern

cyclic storage the distance is also 8.For rowwise storage, we have aco· (∆ico/m· np)· C1= 9.

5.4Pattern cyclic enumeration and row, column, or pat-tern cyclic storage

For pattern cyclic enumeration it holds that u = j0.The

generation function of pattern cyclic enumeration (13)

con-tains three terms.The last two terms GTBco(u) + is are

used to generate a local element access table LTBco(u) to

the stored local elements, by calculating a full global-to-local function for each value of GTBco(u) + isand

subtract-ing g2l(fix(is)).This can be done using any of the storage

compression schemes.

AA

0 1 2 7 6 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39

AA

0 16 32 11 27 6 22 38 1 17 33 12 28 7 23 39 2 18 34 13 29 srow scol

pattern cyclic enumeration

AA

0 1 2 6 7 11 12 13 16 17 18 22 23 27 28 29 32 33 34 38 39 spc 8 8 8 8 9 9

Figure 6: Pattern cyclic enumeration combined with

row-wise, columnrow-wise, and pattern cyclic compression.

The ﬁrst term j1·∆icoleads to the same oﬀ-set constants

as in columnwise generation, except that it is now applied to the outer loop variable j1 instead of the inner loop variable j0.Following this, we obtain the following three equations:

g2l_scolepc(αpc) = g2lcol(fix(is)) + γ + aix·∆ico

∆ial · j1

g2lsrowepc (αpc) = g2lrow(fix(is)) + γ + aco·

∆ico

m· np· C1· j1

g2lspcepc(αpc) = g2lspc(fix(is)) + γ + aix·∆ico

∆ial · C3· j1

with αpc= fix(genpc(j0, j1)) and γ = LTBco(u).

Returning to the example: For the three storage schemes, the following tables are derived: LTBsrowco : [0, 1, 2, 3] →

[0, 2, 3, 7]; LTBscolco : [0, 1, 2, 3] → [0, 6, 2, 4]; and LTBspcco :

[0, 1, 2, 3] → [0, 2, 3, 6].In all cases is = 0.The constants

are equal to the ones used in columnwise enumeration, i.e., 9 for rowwise storage and 8 for both columnwise and pattern cyclic storage.The results are shown in Fig.6.

5.5 Other optimizations: outer loop tables

In row or column enumeration each starting point of a row or column requires a full global-to-local function calcula-tion.For large values of m or many rows this number of function calculations can be quite substantial.If the inner loops are short, this contribution to the overhead is signiﬁ-cant.In those cases, it might be proﬁtable to calculate the starting points in advance and let the outer loop be gen-erated from the table values.The starting points can be simply obtained from the rowwise or columnwise enumer-ation scheme, respectively.In practice, this method only makes sense if the total parallel loop is invoked a number of times with unchanged parameters.Otherwise table con-struction time will be larger than directly performing the global-to-local calculations.

The number of starting points of the columnwise scheme is equal to ECco.In fact this is similar to the pattern cyclic

table construction method.For rowwise enumeration the number of elements is equal or less than ECco, since a row

(7)

scheme order order gal gco stor. enum

enum stor. tables tables

erow/srow ◦ ◦ erow/scol ◦ ◦ erow/spc ◦ ◦ ◦ ◦ ecol/srow ◦ ◦ ecol/scol ◦ ◦ ecol/spc ◦ ◦ ◦ epc/srow ◦ ◦ ◦ ◦ epc/scol ◦ ◦ ◦ ◦ epc/spc ◦ ◦ ◦ ◦ ◦ ◦

Table 2: Properties of enumeration and storage schemes.

6 Analysis and performance

Each combination of enumeration scheme and local storage compression scheme has different properties.These prop-erties are summarized in Table 2.The columns of the ta-ble denote the various properties of the combination.The first column shows the enumeration and storage compression combination.The second column is in-order enumeration. With that we mean that the local sequence of generated global indices has a lexicographic ordering.The third col-umn, in-order storage denotes the same for local array el-ements.The fourth and fifth column indicate whether gcd calculations are required.Finally, the sixth and seventh col-umn indicate when tables are needed.

The number of storage tables and/or galcalculations is

equal to the number of distributed dimensions of all dis-tributed array objects in a program.Under some conditions storage tables may be shared.These tables and calcula-tions have to be done only once at array creation time or when redistribution or realignment statements are

encoun-tered.The number of enumeration tables and/or gco

calcu-lations required is dependent on the number of index access functions on distributed array objects.Enumeration tables can be constructed once all access function parameters are known.

For local storage, rowwise and columnwise storage com-pression may lead to unused allocated storage.The amount of unused storage space is strongly dependent on the align-ment function applied and the size of the array.However, it is shown in [12] that if rowwise compression is ineﬃcient, columnwise is eﬃcient and vice versa.

To measure index overhead in generating the local

in-dices, we simulated a simple forallstatement A(a*i+b) =

A(a*i+b) + 1 using a one-dimensional distributed array of size 40000.We assume that the address calculation for

A(a*i+b)is done only once, so that in eﬀect the assignment results in an increment operation.This statement was paral-lelized and the time spend in the inner loop of one processor was recorded for all local enumeration methods.We simu-lated two versions: one version treats each index as a single index, i.e., a full global-to-local function must be done in each access; the second version is the optimized linear local access sequence.The overhead of the sequential version of the program has also been included.

The experiment was done on a number of diﬀerent RISC processors: microSparc II, HP-PA 1.1, and MIPS R4400 processors.The results are depicted in the Figs.7, 8, and 9 (log-log scale).In the ﬁgures ‘RC’ means rowwise or colum-nwise enumeration.The variant ’without g2l’ uses linear local sequence generation.The variant ‘with table’

com-0.001 0.01 0.1 1 1 10 100 1000 10000 100000 1e+06 Execution time (s)

Size of inner loop (elements) sequential

RC RC without g2l RC with table pattern cyclic pattern cyclic without g2l

Figure 7: Enumeration overhead for HP-PA processor

0.001 0.01 0.1 1 1 10 100 1000 10000 100000 1e+06 Execution time (s)

Figure 8: Enumeration overhead for MIPS processor

prises rowwise or columnwise enumeration with outer loop tables.For pattern cyclic enumeration we have used the third method of implementation described in [10], since the original scheme from [2] is ineﬃcient.The overhead of table construction is not taken into account.

Several observations can be made from the performance ﬁgures.For linear access sequences and large inner loops, the pattern-cyclic method is about 50% slower than the other methods described in this paper.On short inner loops (< 50) all methods give an increased overhead.For inner loops larger than 100, the loop overhead approaches pure se-quential loop overhead.Full global-to-local conversions for arbitrary accesses are in all methods expensive (up to 30 times slower than in sequential execution).

However, there is more to be said.The size of the inner loop cannot be freely chosen.In both the pattern cyclic and the rowwise enumeration method, the size of the inner loop is bounded by m.Hence for small values of m and relatively long local index sets, columnwise enumeration will outperform both other methods as can be derived from the performance ﬁgures.

7 Related work

Various authors have addressed the problem of local set enu-meration for cyclic(m) distribution.For cyclic(m) distribu-tions with no alignment a rowwise and columnwise solution

(8)

0.001 0.01 0.1 1 1 10 100 1000 10000 100000 1e+06 Execution time (s)

Figure 9: Enumeration overhead for Sparc processor

has been given in Paalvast et al.[3].The rowwise solu-tion allows array access funcsolu-tions to be monotone and the columnwise solution requires the array access function to be linear.In Reeuwijk et al.[12] this result has been extended to include alignment.

Two related approaches have been presented by Stich-noth et al.[7] and Gupta et al.[8].Essential in these approaches is the notion of virtual processors for solving the

cyclic(m) case.Each block(m) or cyclic(1) solution for a

regular section of an array is assigned to a virtual processor, yielding a so called virtual-block or virtual-cyclic view, re-spectively.Stichnoth et al.[7] use a virtual-cyclic view and use the intersection of array slices for communication gener-ation.Gutpta et al.[8, 9] present closed form solutions for both the virtual-block and the virtual-cyclic view.

Chatterjee et al.describe a non-linear method to enu-merate local elements [2] by using a finite state machine approach.Recent papers describe efficient methods [4, 10] to construct this finite state machine.Kennedy et al.also showed [11] that their method can be used without a table, using a demand driven evaluation scheme, at the expense of some performance.

Ancourt et al.[5] use linear algebra techniques to con-struct enumeration and communication sets.They also present a symbolic solution for columnwise local index enu-meration.

Chatterjee et al.[2] uses a local storage compression scheme in which the array elements are stored in lexico-graphic order without holes.The global-to-local function is in fact implemented through the FSM.In Kaushik et al. [9] similar dense storage compression schemes are employed. Each scheme is specific for an enumeration method.Chang-ing the enumeration method for an array object, leads to a reallocation of the array.Also no global-to-local formula is given, yielding to table generation time overhead for trans-lating the global index sets to the local index sets.Stichnoth et al.and Midkiff [7, 14] use a block compression method with the cycle number as second index.Ancourt et al.[5] derive a formulae for a columnwise compression method. Their global-to-local function remains rather complicated, although they do remark that in some cases more efficient solutions can be obtained.Benkner [13] describes a column-wise solution with dense storage and outer loop tables.His global-to-local function is more complicated than the ones in this paper.

Run-time techniques are proposed by Mah´eo and Pazat

[6] and Hassen et al.[15].These methods are based on page they can provide a very eﬃcient global-to-local func-tion.This comes at the overhead and page management. 8 Conclusions

In this paper, it has been shown that efficient local enumer-ation and local storage schemes can be derived.Moreover, if a local storage compression scheme is collapsible, it can be efficiently combined with any local enumeration scheme. The performance of rowwise and columnwise enumeration on short inner loops can be further improved by exploiting specific properties of alignment and array access functions. References

[1] High Performance Fortran Forum, High Performance

Fortran Language Speciﬁcation, version 1.1, Rice

Uni-versity, Houston, Texas, November 1994.

[2] S.Chatterjee, J.R.Gilbert, F.J.E.Long, R.Schreiber, and S.-H Teng, “Generating local addresses and com-munication sets for data-parallel programs”, in Journal

of Parallel and Distributed Computing, Vol.26, no.1, pp.

72-84, April 1995 (ﬁrst presented at PPoPP’93). [3] E.M. Paalvast, H.J. Sips, and A.J. van Gemund,

“Au-tomatic parallel program generation and optimization from data decompositions,” in Proc. 1991 Int. Conf.on

Parallel Processing, August 1991, pp.II 124–131.

[4] A.Thirumalai and J.Ramanujam, “Fast address se-quence generation for data-parallel programs using in-teger lattices,” Proc. of the 8-th Int. Workshop on

Lan-guages and Compilers for Parallel Computing, August

1995.

[5] C.Ancourt, F.Irigoin, F.Coelho, and R.Keryell, “A linear algebra framework for static HPF code distri-bution”, Technical report, no.A-278-CRI, Ecole des Mines, November 1995 (an early version was also pre-sented at the Fourth Int. Workshop on Compilers for

Parallel Computers, Delft, the Netherlands, Dec.1993).

[6] Y.Mah´eo and J.-L. Pazat, “Distributed array

manage-ment for HPF compilers”, High Performance

Comput-ing Symposium, Montreal, Canada, July 1995.

[7] J.M. Stichnoth, D. O’Hallaron, and T.R. Gross, “Gen-erating communication for array statements: Design, implementation and evaluation”, Journal of Parallel

and Distributed Computing, vol.21, pp. 150–159, 1994.

[8] S.K.S. Gupta, S.D. Kaushik, C.-H. Huang, P. Sadayap-pan, “On compiling array expressions for eﬃcient exe-cution on distributed-memory machines”, Tech.Report OSU-CISRC-4/94-TR19, Ohio State Univ.,1994. [9] S.D. Kaushik, C.-H. Huang, P. Sadayappan,

“Com-piling array statements for eﬃcient execution on dis-tributed memory machines: two-level mappings,” Proc.

of the 8-th Int. Workshop on Languages and Compilers for Parallel Computing, August 1995.

[10] K.Kennedy, N.Nedeljkovic,A.Sethi, “A linear time algorithm for computing the memory access sequence in data-parallel programs,” Proc. Symp. on Principles and

(9)

[11] K.Kennedy, N.Nedeljkovic, A.Sethi, “Eﬃcient ad-dress generation for block-cyclic distributions,” in Proc.

of the Intl. Conf. on Supercomp., June 1995,

pp.180-184.

[12] C.van Reeuwijk, H.J.Sips, W.Denissen,

E.M.Paal-vast, “An implementation framework for HPF

dis-tributed arrays on message passing computer systems,”

CP Technical Report series, TR9506, Delft University

of Technology, November 1994, (to appear in IEEE Transactions on Parallel and Distributed Systems).

[13] S.Benkner, “Handling block-cyclic distributed

ar-rays in Vienna Fortran 90,” Proceedings International

Conference on Parallel Architectures and Compilation Techniques, Limmassol, Cyprus, June 1995.

[14] S.Midkiﬀ, “Local iteration set computation for block-cyclic distributions,” Proceedings of the 24-th

Interna-tional Conference on Parallel Processing, August 1995.

[15] S.Ben Hassen, H.Bal, “Integrating task and data par-allelism using shared objects,” 10-th International