Hierarchical Programming and the DPP84

(1)

Programming

and the DPP84

(2)

Hierarchical Programming

and the DPP84

(3)

\V CHNISC/t

Proefschrift

/ & /& / ■ Ï " / \Q Prom0'1" l - J •> \co 'v.

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de rector magnificus

Prof. dr. J.M. Dirken,

in het openbaar te verdedigen ten overstaan van een commissie aangewezen door

het College van Dekanen op donderdag 18 december te 14.00 uur

door

Anthonie Bastiaan Ruighaver natuurkundig ingenieur, geboren 24-12-1951 te Bennebroek.

TRdiss^

1520

(4)

Dit proefschrift is goedgekeurd door de promotor Prof. dr. ir. L. Dekker

(5)

that it should:

Sigmund Freud

To Christel, Chiel and Lisa.

(6)

6

-Summary

Large-scale parallel computers, designed for a static model of computation, are difficult to program using current parallel programming methods based on either implicit parallelism, or on explicit parallelism using the asynchronous process model. Therefore, the majority of projects in the area of large-scale parallel processing accept the use of low level programming and special purpose architectures.

This thesis emphasizes on application of high level programming concepts in parallel processing and its influence on the design of a parallel architecture suitable for a broad spectrum of applications. It describes both the deficiencies of the above mentioned programming methods, as well as offers a solution based on a user-friendly synchronous programming model of alternating computation and communication phases.

This synchronous model, in use on the Delft Parallel Processor (DPP), has led to the definition of an abstract parallel machine with an unlimited number of Virtual Processing Elements, which are fully interconnected. A Hierarchical Programming System is designed, which enables the preservation of explicit global parallelism, necessary for an effective use of Virtual Processing Elements. The strictly hierarchical structure of the resulting parallel programs, offers many advantages from the point of view of software engineering. The interactive features of the

Hierarchical Programming System, based on implementation of the abstract parallel machine in an Automatic Scheduling and Task Allocation (ASTA) Operating System, compensate most of the disadvantages of using the static model of computation.

(7)

Finally, the principal design issues of the Delft Parallel Processor DPP84, the successor of the DPP81, are discussed. Apart from the aspect of cost-performance, using concepts of Reduced Instruction Set Computers, and the programmability of the DPP through full interconnection between PE's, the architecture of the Delft Parallel Processor DPP84 has been tuned for optimal execution of task systems based on the synchronous programming model.

(8)

8

-Samenvatting

Grootschalige parallelle computersystemen gebaseerd op een statisch model van dataverwerken, zijn moeilijk te programmeren met de bestaande programmeerhulpmiddelen, die of gebruik maken van het impliciete parallellisme, óf gebaseerd zijn op expliciet parallellisme volgens het asynchrone "process" model. De meeste projecten op het gebied van grootschalige parallelle dataverwerking aanvaarden daarom het programmeren op een laag niveau van een computersysteem met een, voor de specifieke tóepassing, gespecialiseerde architectuur.

Dit proefschrift beschrijft het gebruik van een geavanceerde programmeermethode voor parallelle dataverwerking en de invloed ervan op het ontwerp van een parallelle architectuur, geschikt voor een groot scala van wetenschappelijke toepassingen. De tekortkomingen van de eerder genoemde programmeermethoden worden aangegeven, waarna de mogelijkheden van een gebruiksvriendelijk synchroon programmeermodel worden behandeld.

Dit synchrone model van dataverwerken afgewisseld met communicatie, in gebruik op de Delft Parallel Processor, heeft geleid tot de definitie van een abstracte parallelle machine met een ongelimiteerd aantal Virtual Processing Elements, die volledig met elkaar verbonden zijn. Een Hiërarchisch Programmeer Systeem zorgt ervoor dat het expliciete parallellisme op elk programma niveau zoveel mogelijk in stand gehouden wordt, zodat voor een efficiënte implementatie van Virtual Processing Elements voldoende parallellisme aanwezig is. De zuiver hiërarchische

(9)

structuur van de zo ontworpen programma's heeft veel voordelen uit het oogpunt van software engineering. Interactief gebruik van het Hiërarchisch Programmeer Systeem, mogelijk gemaakt door de implementatie van de abstracte parallelle machine in een Automatic Scheduling and Task Allocation (ASTA) besturingssysteem, zal de meeste nadelen, inherent aan het gebruik van het statische model van dataverwerking, compenseren.

Tenslotte worden de belangrijkste ontwerpcriteria behandeld van de Delft Parallel Processor DPP84, de opvolger van de DPP81. De belangrijkste aspecten zijn de cost-performance door het gebruik van concepten uit Reduced Instruction Set Computers, en de programmeerbaarheid van de DPP ten gevolge van de volledige verbindingsstructuur tussen de Processing Elements. Daarnaast is de architectuur van de Delft Parallel Processor DPP84 afgestemd op een optimale verwerking van taaksystemen, gebaseerd op het synchrone programmeermodel.

(10)

1 0 -Contents SUMMARY 6 SAMENVATTING 8 CONTENTS 10 Chapter 1. Introduction

1.1 High speed computation 13 1.2 Non-Von Neumann architectures 17

Chapter 2. Principle structures in parallel architectures

2.1 Pipelined vectorprocessing 23

2.2 Processor arrays 25 2.3 Associative processors 27 2.4 Shared memory multiprocessors 28

2.5 Multicomputers 29 Chapter 3. Parallel Programming concepts

3.1 General concepts 33 3.2 Concurrent programming constructs 37

3.3 Modula-2 39 3.4 ADA 40 3.5 Occam 41 3.6 Parallel structuring 43

(11)

Chapter 4. Hierarchical Programming System

4.1 Programming systems 47 4.2 Hierarchy of modules 50 4.3 Synchronous primitives programming model 54

Chapter 5. Software architecture

5.1 Abstract parallel machine 63 5.2 ASTA operating system 64 Chapter 6. Hardware architecture

6.1 General concepts 71 6.2 Reduced Instruction Set Computers 73

6.3 Hierarchy of control 74 6.4 Communication 75 Chapter 7. Delft Parallel Processor DPP84

7.1 Structure 81 7.2 Organization 86 7.3 Implementation 87 7.4 Performance 88 Chapter 8. Conclusions 91 REFERENCES 95 APPENDIX CURRICULUM VITAE

(12)

1 3

-Chapter 1

Introduction

1.1 High speed computation

Inadequate computational speed has been the major limitation throughout the history of electronic computing. Although many other factors might seem to limit the applicability of modern information processing technology, most of them become irrelevant as soon as abundant cheap computing power is available. Even the present-day software crisis may for a part be blamed on the many cases where computational efficiency is still considered to be more important than flexibility and standardization. The increasing pace of technological progress in the last decade has now started a slight change in attitude. The wide-spread acceptance of the UNIX operating system, which does not offer effective use of computer resources, shows the benefits of applying also other criteria in the design of Information Processing Systems.

Most designers of current computers had to rely on the type of components available on the market. In the near future real top-down design of computer systems will become common, implying that the components are designed last. As a result, complexity will be reduced and performance improved. The most beneficial in this respect is the use of Very Large Scale

(13)

Integration [BURG84]. The high density of VLSI enables the building of small sized high performance computers with less components. At the same time the shorter distances between gates increases the speed of VLSI circuits. Much of this speed is lost, however,. when signals have to be transported off-chip. Hence, the enormous speed up of the latest generation of microprocessors with the complete processor and fast cache memory on one chip. Compared with early designs using vacuum tubes, the speed of the processor has increased four orders of magnitude [BAER84].

Until now computing has been dominated by the sequential nature of the Von-Neumann architecture. This architecture in its original form, is characterized by a single connection between memory and central processing unit (CPU). Hence, all algorithms are designed as a sequence of operations to be performed one at a time. Current Von-Neumann type computers no longer perform all operations sequentially. Extensive use is made of parallel functional units and pipelining. Essentially, parallelism is just a system feature by which more hardware is used concurrently on the execution of a computing workload [INF077]. A computer system is only considered to be a parallel system, when the description of the architecture includes some of this parallelism, to be used by the software.

The decreasing cost-performance in processor technology enabled the introduction of functional parallelism on system level. The differentiation between I/O processors and main processor first introduced in the Larc system [FERN81] has been very successful.

For compute intensive applications the use of attached application processors also became popular. Especially peripheral array processors proved to be cost-effective [KARP81] and the literature on those machines and their applications is extensive [LÖUI81]. In many recent architectures I/O processors and application processors (Scalar or Vector) are standard features and designers of next generation computers can not disregard the use of functional parallelism on system level.

(14)

1 5

-For many applications in scientific computing, however, faster technology and use of functional parallelism will not lead to adequate speed up. The fact that we are soon reaching fundamental limits in the speed of components, will result in a growing interest for large-scale parallel systems. Especially multi-microprocessor design is booming. Most ideas on parallel architectures are however relatively old. An excellent review is found in [ZAKH84]. The improved technology of today enables us not only to reevaluate but also to combine them [REQU83].

As a result of this combination of architectural features, the classification of parallel processors is complex. The most used Flynn classification [FLYN72], which divides parallel architectures in Single Instruction Multiple Data stream (SIMD) and Multiple Instruction Multiple Data stream (MIMD), will soon be superseded. Next generation supercomputers will all have a MIMD structure and SIMD (section 2.2) will be nothing more than an internal computation mode for these machines. Special purpose SIMD architectures and pipelined arithmetic units may be part of the internal structure of a processor. With the introduction of vectorinstructions the sequential nature of programming for these SIMD processors is retained.

Until now, the Von-Neumann stored-program computer model has had no serious competition. Next generation sequential computers will have a performance improvement of at least one order of magnitude, owing to both better technology and better design techniques. The market will therefore still be dominated by Von-Neumann type microcomputers used in embedded systems and personal workstations. Their speed will be enough for most applications like text processing, communication with the outer world and small-scale scientific work.

Although many personal workstations will be connected to large Local Area Networks, it is not yet clear whether it is economical to organize them in such a way, that idle units may be used to create a multi computer system. This is the realm of Distributed

(15)

Processing [MART81]. It will probably be more economic to use a parallel processor in the network as an application server. Furthermore, one should realize that a tightly coupled parallel processor is a more general purpose tool than a loosely coupled distributed system.

To enable easy migration of programs from sequential workstations to parallel application servers, it is necessary to develop abstract parallel machines [TROT85] suited to scientific applications and acceptable for users of sequential machines. When an abstract parallel machine offers enough advantages for (system)programmers to use as a higher level abstract machine, on top of a sequential machine, then large-scale parallelism might no longer stay an exception in computer architecture.

The subject of this thesis is the specification of an abstract parallel machine and the design of a software/hardware architecture for the implementation of the abstract parallel machine. This work is greatly influenced by the experience gained with the Delft Parallel Processor DPP81 [RUIG82]. This MIMD structured system, developed at the Department of Applied Physics of the Delft University of Technology [BROK83, SIPS84.KERC85], has been operational from 1981 (both hardware and software).

The group researchers around the Delft Parallel Processor have always been involved in system simulation. Therefore, most of its current applications are found in the area of parallel simulation. My activities as to the DPP81 have mainly been the design and development of (system)software.

The remainder of this first chapter will be on discussing Non-Von (Neumann) architectures in general.

In Chapter 2, the most frequently used structures in control driven parallel hardware are outlined.

Chapter 3 focuses on the concepts used for the expression of parallelism in programs. After a section on concurrent programming constructs, shortly the use is described of the three

(16)

1 7

-most popular languages in the programming of MIMD structured machines: Modula-2, ADA and OCCAM.

Section 3.6 presents a user-friendly approach for the formulation of parallel constructs.

A description of the Hierarchical Programming System, given in chapter 4, forms the central part of this thesis. This integrated environment which originates from the toolset of the DPP81 for use in Simulation and related disciplines, has been the driving force leading to the design of an abstract parallel machine. It is meant to be implemented on top of it.

The software aspects of the abstract parallel machine are discussed in chapter 5.

In chapter 6 some general aspects of the hardware architecture for next-generation MIMD computers are discussed. An implementation of a parallel architecture that reflects this description to a great extent is the DPP84. However, the design of the DPP84 started as a redesign of the DPP81 and is therefore heavily influenced by the need to remain compatible and especially the need to deliver a reliable commercial product within a limited time.

The principal aspects of the architecture of the DPP84, e.g. its structure, organization, implementation and performance, are discussed in chapter 7.

1.2 Non-Von Neumann architectures

The design of advanced non-von architectures reflects the search for alternative program organizations enabling efficient exploitation of the available parallelism. Hence, the main categories are classified according to the flow of the computations within a computational unit.

(17)

One distinguishes:

- control driven computation - data driven computation - demand driven computation

In control driven computation the flow of control is explicitely defined. There is no direct relation to the flow of data.

In contrast, data driven computation implies that a computation task may be executed as soon as all the corresponding input data are available.

In demand driven computation a computation task will be invoked when one of its output arguments is needed.

The fundamental issue in the search for new architectures is the controversy between the dynamic and static model of execution. Both data driven and demand driven computation lead to dynamic allocation of executable entities to the computing elements. This looks attractive, but it results in a certain overhead, either in hardware or in software, compared to control driven computation.

The data driven approach originates from the wish to exploit the natural parallelism as found in the compilation of sequential programs. It is based upon the direct execution of dataflow graphs [DAVI82] that are used in the performance optimization phase of a compiler. Hence, the research in dataflow architectures has been aimed at fine grain parallelism, i.e. parallelism at the level of multiplication and addition operators.

A pioneer in this field is Dennis [DENN80] at the Massachusetts Institute of Technology.

Static mapping of a dataflow graph on a control driven non-von architecture will in general lead to non-optimal use of the arithmetic processors. To enable dynamic mapping a reference is

(18)

1 9

-attached to each produced result, pointing to the instructions needing it. Those instructions, for which all input operands are available, are scheduled for execution in an available arithmetic unit. In a data flow computer based on the general architecture of figure 1.1, the update unit places the adresses of activities ready for execution in a queue. Such an activity consists of an instruction together with its operands. The fetch unit retrieves these activities and sends them to an available functional unit.

In this way a dataflow computer tries to make maximal use of the available parallelism in a program. Even more appealing is the ability of more elaborated dataflow architectures to exploit the parallelism existing between successive iterations in a program loop, resulting in an even bettér utilization of processor resources. ^ F u n c t i o n a l U n i t s Update Unit ■^ f ^» Queue A c t i v i t y Store -a ««* Fetch Unit

Figure 1.1: Circular pipeline in data flow architecture.

However, the remark should be made, that it is doubtful whether the theoretical gain in efficiency of a dataflow approach is enough to compensate for the increased overhead. The fastly increasing speed of pipelined arithmetic units compared tó the speed of

(19)

memory and communication hardware, has apparently prevented the implementation of dataflow concepts in current supercomputers. The use of VLSI technology has dramatically decreased the cost of arithmetic hardware while communication has become relatively expensive. Efficiency of communication and efficient use of memory will therefore be more important than the efficiency of arithmetic units. Other problems in the use of the dataflow paradigm can be found in an excellent criticism by Gajski et al [GAJS82].

All in all, the programming problems around control driven parallel processors might be easier to solve than the technological problems around the introduction of dataflow architecture as a cost-effective alternative. Still, the dataflow machines have an intuitive appeal to many computer scientists. May be, that this is the reason why optimistic reports regarding its use in next-generation computers are found in many recent publications. Even the Japanese Fifth Generation Project pays much attention to the dataflow approach [TREL82a].

Lately, the combination of dataflow with other more traditional architectural structures has become a popular research topic [TREL82b]. The performance of vector supercomputers is heavily degraded by even a small amount of non-vector code [RUDS77]. Hence, the proposal of a Piece-wise Data Flow Architecture [REQU83] that uses a small-scale dataflow processor to improve non-vector operations. It should be noticed, however, that even the dataflow approach cannot speed up the execution of pure sequential computations. In these cases, where each computation depends on the result of the preceeding computation, the speed up will always be limited by the scalar speed.

The least explored category of non-von architectures is the one using the demand driven paradigm. Recently, a growing interest in Artificial Intelligence is also stimulating the design of parallel demand driven (reduction) machines such as ALICE [DARL81]. The problems in this area of parallel processing are not essentially

(20)

21

-different from those in the other categories. The question whether to use implicit [CONE81] or explicit [FRIE78] parallelism and the choice of communication [VEGD84], seem to be as difficult as in more conventional parallel architectures.

Like in dataflow computation, the use of a dynamic computation model results in excessive use of communications or the need for ample parallelism, compared to the number of processors. On the other hand, the use of a dynamic demand driven approach might perhaps be easy to combine with an otherwise staticly operated control driven architecture. The availability of such a special operation mode might prove to be useful in those applications where abundant parallelism is available and the use of recursion offers unmistakably advantages for the programming.

In this thesis, however, let us restrict ourselves to the design of a software and hardware architecture for a machine based on the static model of computation.

(21)

(22)

2 3

-Chapter 2

Principle structures in parallel architectures

2.1 Pipelined vectorprocessing

Pipelining is the form of parallelism most used in computer architecture today. It is found at any level of computer design. In this section only the use of pipelining (figure 2.1) on arithmetic level [TATE84] will be discussed. The first processors using pipelined floating point units, the CDC Star-100 and the Tl ASC, were available on the market in the period between 1970 and 1976 [HWANG84a].

The bottleneck at that time was the software support. Efficient FORTRAN compilers, that could extract the available parallelism from computational loops, still had to be developed. Hence, the development of shared resource vectorprocessors [FLYN70], where the pipeline is shared by independent data streams. A commercial product of this type of MIMD processors is the HEP [SM|T78]. The recent improvement of vector compilers has superseded this approach.

The success of the pipelined approach is through its economy. Simple pipelined units are only a little more complicated than non-pipelined units and still achieve a significant speed up. The more complicated multipurpose pipelined units have been less

(23)

successful. Owing to the increased set up time of these units, they are only efficient with long vectors. The ability to achieve sufficient speed up for short vectors is an important criterion in the acceptance of a vector processor as a general purpose tool.

Another problem with pipelined supercomputers is the design of memory and channels to feed the pipeline. The increased speed of a longer pipeline enforces more expensive solutions for these units and may prove not to be a cost-effective approach at all.

Finally, experience has shown, that vector speed needs to be related to the scalar speed [BUCH84]. Processors, with too high a vector speed compared with their scalar speed, cannot achieve enough speed up in most computations, owing to the large percentage of scalar code in the programs. This fact is often referred to as Amdahls Law [LUBE85]. In this light, it is mainly the large scalar speed of the second generation supercomputers, starting with the Cray 1 [BASK77a], that is responsible for their enormous success.

Vector processing alone does not offer the solution to the ever increasing demand for computing power. The speed up achieved with one pipeline has not been enough and multiple pipelines are now common. Unfortunately, the performance of these new supercomputers (section 2.3) is somewhat limited, owing to inadequate communication capabilities. The use of shared memory or shared registers has proved to offer insufficient speed up for single jobs. step 1 step 2 step 1 step 3 step 2 step 1 step 4 step 3 step 2 step 1 step 4 step 3 step 2 step 4 step 3 step 4 Clock cycle 1' ' 2 3 4 5 6 7

Note: Each instruction consists of 4 steps of equal time length.

(24)

2 5

-2.2 Processor arrays

The progress in silicon technology was first only advantageous for processor design. It enabled the use of more functional units inside the processor and gave rise to a family of SIMD architectures [SLOT81]. The main characteristic of these "parallel processors", such as Solomon, ILLIAC IV [BARN68] and more recent the Burroughs Scientific Processor [STOK77.KUCK82], is a centrally controlled array of simple Processing Elements (PE's). Interconnection between the PE's is normally restricted to nearest neighbours. A description of the many SIMD architectures can be found in [THUR76].

The pipelined arithmetic units in the current generation of supercomputers have since proved to be more successful than the use of SIMD structures. SIMD machines are only useful on very long vectors and are therefore more special purpose then pipelined vector machines. In addition, special control hardware is needed to work with vectors which length is not a multiple of the array dimension [KOZD80].

One surviving commercial SIMD machine, the Distributed Array Processor (DAP) [FLAN77] can only compete with other supercomputers in a restrictive application area. The DAP, manufactured by International Computers Limited in England, consists of a two-dimensional array of 64 * 64 processing elements. Each processing element is connected to its 4 immediate neighbours with 1 bit wide data paths (figure 2.2). Since the PE's are bit-organized, there is an enormous flexibility of precision and representation of data. Although the arithmetic is dependent on low level software (algorithms) [PARK82], the trade-off between precision and speed [PARK83] makes the DAP attractive in areas such as image processing.

A similar bit-oriented 4-neighbour connected architecture, especially designed to process satellite imagery is the 'Massively Parallel Processor [BATC80]

(25)

With the further advances in VLSI technology the distribution of control became less expensive and the emphasis shifted to MIMD architectures. Recently, however, it is becoming feasible to integrate more nodes of a SIMD structure in one chip. Hence, a renewed interest in Multiprocessor Lattice Architectures [DEW82], such as the Wave Front Array Processor [KUNG82a] and the Configurable Highly Parallel Computer (CHiP) [SNYD82]. Another emerging VLSI architecture is the Systolic Array [KUNG82b] using hardwired (non programmable) processing cells.

It should be realized, however, that the increasing density and. die-size of VLSI chips will soon also enable the integration of MIMD structures. The role of the above mentioned SIMD architectures will be limited to a few special-purpose applications. Register set * ■ ■ ro hi w ghways column ' highways i ' i Broadcast to PE arr, r Processing E a m a n t AlTc iy

-t

8

ed Instruction buffer

Note: Memory of Processing Element Array also contains the program instructions used in the central control.

(26)

2 7

-2.3 Associative processors

An associative processor is a parallel processor, which uses an associative process for the activation of its Processing Elements. Although all large-scale parallel architectures have associative properties, there is also a class of special purpose architectures called Content Addressable Memories (CAM). They are well suited for information processing problems where store locations have to be found related to a special content. Not only database searches belong to this application area, but also such operations as the determination of the maximum or minimum over a certain dataspace.

Apart from the use of small CAM'S within a processor for the implementation of virtual memory techniques, associative processing has not yet found its way to the general market. Like other parallel architectures, user programming in a high level language poses a problem [INF077]

The most straightforward implementation of a CAM is organized in cells with mask and compare hardware parallel with every bit of storage. Such a configuration uses much hardware but achieves maximum speed up. A problem is the transfer of data from secondary memory to the CAM. Unless the whole dataspace can be (semi) permanent in the CAM, the speed of the machine is determined by the I/O speed. Hence, the use of slower bit-serial cells such as in STARAN [MEIL81].

The topological structure of associative processors does not necessarily include connections with other cells. An example of such an ensemble of processors is PEPE, a special purpose processor used in defense applications [MART77].

As mentioned, all parallel architectures have associative properties. Therefore, it is hard to foresee the role of CAM in combination with other parallel architectures. Perhaps, future versions of the Delft Parallel Processor will prove the value of CAM in a MIMD architecture, since work is done to equip the PE with a special coprocessor based on CAM. In any case, it is

(27)

extremely important for a designer of a parallel architecture to make sure that no bottlenecks are introduced, inhibiting the use of the implicit associative properties of his machine.

2.4 Shared memory multiprocessors

For computer architects memory has always been an expensive and scarce resource [KUCK77]. This has led to the use of extensive memory hierarchies of up to 5 layers [INF084]. At the top of the hierarchy are the dataregisters, then a cache followed by main memory. At the bottom are secondary memory devices, sometimes preceded by bulk memories or other cache systems. To make effective use of these hierarchies many software techniques were developed, with multiprogramming and virtual memory as the most successful ones. A logical step is also the sharing of this resource by a number of processors. The resulting multiprocessors are designed to increase the throughput of the computersystem with each processor running an independent job. Although only systems with a few processors proved to be efficient, the increased availability through the inherent redundancy in hardware of these multiprocessors is an important advantage [ANDE81].

In scientific processing the demand for interactive computation is increasing, while the price of the equipment is rapidly decreasing. Interactive use of programs implies a fast turnaround and the execution time of an individual program is now more important than a fast throughput of a set of programs. When each processor in a multiprocessor runs an independent job, turnaround time of these jobs increases, owing to memory conflicts [JOSE84]. Of course the processors could also cooperate on one job. If we overlook the partitioning problem for now, in general the speed up of this approach is limited (figure 2.3). The main bottleneck is access of main memory. To increase memory bandwidth, interleaved memory [HOCK81] is often used. However, when data have to be shared between some processors, the shared memory banks still need to be faster compared to those used in a single processor machine.

(28)

-29-Since most general purpose supercomputers appearing on the market now, such as the Cray XMP [CHEN84] and the ETA 10, combine several multiple pipelines with shared memory or shared registers, the aspects discussed are especially relevant to these machines.

ideal

J I

1 2 3 4 5 6 Number of Processors ►

Figure 2.3: General trend of shared-memory multiprocessor speed up

2.5 Multicomputers

There always has been a clear distinction in the structure of a computer system between the processor and the storage devices. With the increase of density in VLSI technology and the interest in parallel processing, time has come to reevaiuate the position of the main memory. In a multicomputer system (figure 2.4) main memory is distributed together with the processors. The use of separate memories is necessary both to improve the memory bandwidth of the system and to insure that at least part of the memory is in close proximity to the CPU to achieve high speed operation.

t .

3

(29)

Many multicomputer systems, however, still use the concept of main memory as the central storage of programs and data. A system like CM* enables each node to access every part of main memory through a hierarchically distributed switching structure [GEHR82], equipped with special mapping processors. Hence, the only difference with a shared memory multiprocessor from the programmers point of view is the access-time hierarchy of the memory banks: Access of local variables is several times faster than access of global variables. The use of these distributed multiprocessor architectures will be limited to the execution of loosely coupled tasksystems

Global Interconnection System

Memory Memory

Figure 2.4: General multicomputer architecture.

With the arrival on the market of the Transputer and its programming system OCCAM (section 3.5), the emphasis is finally shifting to real multicomputer systems. The abandoning of main memory concept, in" both architecture and programming, opens the door to effective utilization of large-scale parallel processing.

(30)

31

-The Transputer is a single chip design (figure 2.5), containing a processor, local memory and links for connecting to other Transputers. A 16 bit and a 32 bit Transputer are available now while a Floating Point Transputer is expected soon. The processor of the Transputer is optimized for expression evaluation and process scheduling. It is a Reduced Instruction Set Computer (section 6.2) with a small 3 level hardware stack for the storage of intermediate results. Instead of the large register file, normal for RISC designs, a workspace is used in local memory, that can be accessed in one processor cycle.

r ^ Link fc

OHÊL

OQisE

\Z

\zjcnjK

Off-chip Memory Interface

Figure 2.5 Transputer architecture.

The coupling of Transputers is limited to point to point communication with 4 neighbours (10 Mbytes/s). This allows easy extension of a Transputer network, but limits the applicability of

(31)

such a system. In practice, one single Transputer is not sufficient to be used as a Processing Element. Normally, two or more Transputers are needed to create a node with more links (figure 2.6). To increase the processing power a system designer may add additional arithmetic (vector) units, as is done in the Tesseract series of Floating Point Systems. As a result, it will also be necessary to design a new compiler for such a Processing Element. The accessibility of a PE should be increased, for instance by using the memory interface of the Transputer to link to a global communication system.

The programmer of multicomputers based on the Transputer will often be responsible for the allocation of tasks to each node, taking into account the limited communication capabilities. Less efficient but very popular will be the use of dynamic allocation of processes in a one or two dimensional pipeline of PE's (message passing architecture). The overhead through the dynamic allocation of processes and the large delay in the transfer of messages, limits the applicability of these machines to workloads with much processing on small amounts of data.

(a) (b)

(32)

3 3

-Chapter 3

Parallel Programming concepts

3.1 General concepts

A language is a problem oriented programming tool and apart from the generally accepted structured programming techniques, every language has its own syntax rules adjusted to the specific problem area. As a result any successful language stands out above the others when it is used for the purpose for which it has been designed.

The best example is FORTRAN, designed for mathematical formula translation and first released to customers in 1957 [HUNT82]. Although revised a few times, FORTRAN is still the language most used in scientific computing. The research in languages for parallel processing has not yet delivered a successful candidate for taking over this role. Even in the area of vector processing, where many FORTRAN compilers have been extended with syntax constructs [PAUL82,HWAN84b] to insure efficient use of the vector hardware, no standardization is yet to be found.

Until now, most research on the programming of parallel processors has been directed to the detection of dependences in programs to extract the implicit parallelism [RAMA69]. Despite the enormous effort in this area of research, especially at the university of Illinois by Kuck et al [PADU80.KUCK84], this has not

(33)

resulted in a real breakthrough for the programming of MIMD structured parallel processors. Only for the present generation of supercomputers, which pair their vectorprocessing units with fast scalar processors, the results have been useful. Even for these machines, the execution of old FORTRAN programs gives only a limited performance, when compared with newly programmed applications. This proves how important it is for the programmer to know the many pitfalls, which inhibit efficient extraction of parallelism.

Because the traditional imperative class of languages allows all sorts of constructions from which parallelism is hard to detect, the attention has been focussed on the class of applicative or functional languages. These languages like (pure) LISP and Backus FP [BACK78] are based on the application of functions to values, eliminating references to variables, which may cause so called side effects [ACKE82]. These side effects, for example the modification by a procedure of the value of variables in the calling program, are difficult to prevent in the present imperative languages.

Dataflow languages [TESL68] belong to the class of functional languages aimed at scientific computing and developed for use on dataflow computers. These languages offer no features for the expression of explicit parallelism but rely on other conventions, such as the single assignment rule [SYRE82], to diminish the dependencies between statements. A variable, in a language using this single assignment rule, may appear on the left side of only one statement in a program unit.

However, the radical different programming style is not likely to be accepted on a large scale and the claim that these languages achieve a much better parallelism does not seem to be true in scientific applications [GAJS82].

Whenever a programmer has to rely on tools to detect implicit parallelism, he has no explicit control over the amount of sequential code left in his program. This sequential residue in the

(34)

3 5

-code will often decide the efficiency of execution on a parallel architecture, not the amount of parallelism. Amdahls law, as discussed before, is especially valid for large-scale parallel machines and any program to be implemented on such a machine should be practically free of sequential code.

To clarify the importance of this fact, a simple example will be discussed using a set of 100 independent (parallel executable) tasks to be executed on a machine with 100 identical processors. The execution times of each task may vary between 1 and 2 units of time and to simplify the example even more, the worst case of one task executing in twice the time of the others will be used (figure 3.1 a).

][

1.01 100 (a) 100 (b)

Figure 3.1: Two simple task systems.

It is easy to see that the utilization of the PE's will be approximately 50%. It should also be clear that, when it is possible for the machine or programmer to balance the load, the same speed up could be achieved with 51 PE's. The execution of

(35)

this tasksystem may be compared with that of a second tasksystem in figure 3.1 b. In this new system we have 100 tasks of equal length followed by a piece of sequential code. The amount of sequential code is extremely small, only 1 % of the total sequential execution time. However, this small amount of code limits the utilization of PE's to 50% and halves the maximum speed up.

Parallelism is a natural concept, and there is no reason not to allow a programmer to use explicit parallelism in the description of his program. Early implementations of explicit parallelism, based on the Fork - Join concept [TREL79.SCHW86] have not been successful. The same is true for the different Doall and Forall constructs [PADU80] used to express the parallelism in a loop construct. The basic problem is the lack of scope control in these constructs. The variables are not local to the different parallel branches of the construct and use of common datastructures is not inhibited. As a result the program output may depend on the scheduling of the computations and on the relative speed of the processors used. As in languages for concurrent programming [BRIN79], additional constructs are needed to control (lock and unlock) the access of shared resources. Before we discuss the programming constructs involved in a section on concurrent programming, a discussion of some other concepts, used in the latest generation of programming languages, is in order.

The relatively new discipline of software engineering has initiated some important developments in directions like data abstraction and modularization [WULF80]. Programming languages like FORTRAN emphasize on computational abstraction, e.g. the use of function and procedure mechanisms. The procedure provides structure to a program and also creates an environment to control the access of local and global data. The implementation of this last aspect [BISH80] is of primary importance for the efficiency of a program.

(36)

-37-ln modern high level languages computational abstraction has been succeeded by data abstraction and program abstraction [BUZZ85]. The popularity of the rather poorly defined [FREE83] term

'object oriented' suggests a large scale acceptation of these

concepts. In general, an object oriented system is based on the application of functions to objects. The goal is to protect these objects from being accessed or modified except by a set of fixed procedures, which are grouped together with the objects in a module. Often the lack of additional concepts [GOGU86] for the control of the complexity of the software, results in the

unfortunate use of design methods emphasizing on the construction of modules with small and well protected interfaces. Fortunately, the program abstraction concepts available in such a language, enables the experienced programmer to make some compensations for the inflexibility of such a module interface.

The use of program abstraction in programming will have an important influence on the development of programming tools for a parallel processor. Especially the possibility to create instances of a procedure (generic concept [BEID86]) enables explicit control of the structuring of programs and can already be found in most concurrent languages used for parallel processing. However, the use of program abstraction in an environment relying on the extraction of implicit parallelism, may have less positive consequences.

3.2 Concurrent programming constructs

The growing complexity of operating systems and embedded software has been stimulating the use of concurrent programming tools. Implemented on a sequential processor, concurrent programs can exploit the ability of the peripheral devices to operate in parallel with the processor. Hence, the emphasis in concurrent processing has largely been on the building of operating systems where the sharing of resources is a major issue [BRIN77].

(37)

Therefore, the concepts used should insure mutual exclusion, i.e. the exclusive access to a resource. The earliest mechanism for concurrent programming, the semaphore, is only a low-level synchronization tool and its use has therefore led to many problems. An overview of the more advanced mechanisms, such as the use of critical regions and monitors, can be found in [GEHA84]. Most recent concurrent programming concepts are based on the cooperation of sequential processes, communicating by first synchronizing and then exchanging information [HOAR78]. The Implementation of this principle in the language ADA is called a rendezvous.

The improved efficiency of concurrent programs on a sequential processor, does not necessarily lead to an effective implementation on computers offering genuine parallelism. The allocation of processes to processors is a complex problem and whether an effective solution is possible depends greatly on the structure of the tasksystem. In addition, the use of these concepts in the interaction between processes and for the exclusive access of shared data, results in a significant overhead limiting the applicability. For instance, to achieve large scale parallelism in those applications that involve vector and matrix operations, one should not use concurrent processes. Programmers working on the Cray X-MP should choose the microtasking facility, instead of using multitasking [LARS84,MEUR85].

Where the use of concurrent programming on parallel machines is stimulated by the current generation of (system) programmers, its introduction in the area of scientific computations may not be successful at all. Concurrent programming is generally considered to be more difficult than sequential programming. This is a result of the dynamic nature of processes; a process in execution may be active or inactive, depending on the communication with other processes or on the availability of resources. Hence, it is difficult to judge the efficiency, when a set of processes is executed. In most concurrent languages there is even a possibility of deadlock, e.g. two or more processes that wait on each other forever.

(38)

-39-As will be shown in the following sections on three popular languages for concurrent programming, the use of these languages does not offer a solution for the efficient (automatic) allocation of processes to the available computing elements.

3.3 Modula-2

The language Modula-2, designed by Wirth [WIRT85], is considered by many personal computer users to be the successor of PASCAL. Wirth did not intend Modula-2 to be used in the same application area of PASCAL. It was meant for the programming of real-time systems that also have concurrent processes.

The module concept, and the information hiding it provides, proved so useful in more general applications, that Modula-2 can no longer be considered as just a language for system programmers. The success of Modula-2 in sequential programming is not matched by an equal success in concurrent programming. Where the other two languages discussed in this chapter are based on the use of processes, Modula-2 uses the lower level coroutine concept.

A coroutine explicitly transfers control to another coroutine [STAN82], herewith preventing the large overhead common in systems using process schedulers. Coroutines have the same properties as procedures, with the exception that they may return control to their calling programs before they are completely executed. With the next call they will resume execution from the point of suspending. Hence, coroutines have no master/slave relationship but the flow of control is still dependent on transfer statements within each coroutine. Coroutines are therefore not suited for the expression of parallelism. To be able to use parallelism the user needs to implement both a process concept and a process scheduler.

The absence of an implicit concept in the language for interprocess communication and synchronization, is responsible for a promiscuous use of one-of-a-kind constructs tailored to

(39)

some functional requirement, as has been the case in the early years of concurrent processing [SPIE74]. This makes the porting of the resulting programs to different parallel architectures very difficult.

Hence, the conclusion may be drawn that Modula-2 is only suitable for multiprogramming and its use for multiprocessing should be discouraged.

This leaves us with the use of Modula-2 in an environment relying on implicit parallelism. The detection of implicit parallelism within a procedure, is not more difficult in Modula-2 than in FORTRAN as long as no transfer statements are used. The presence of these low level control flow aspects may inhibit efficient retrieval of implicit parallelism and should therefore be excluded in a programming environment for parallel processing.

3.4 ADA

ADA has been developed under supervision of the Department of Defense in the USA as the only language to be used in defense (embedded) systems [ICHB83]. Like Modula-2, the basic ADA syntax is related to Pascal, but with a larger emphasis on various degrees of information hiding [BREN81]. An excellent introduction for experienced programmers can be found in [HABE82].

On the subprogram level ADA provides the programmer with two different concepts: the Package and the Task. The Package concept is similar to the module concept of Modula-2, in that it allows the grouping of a set of related procedures and the hiding of implementation details.

The Task is the ADA concept for the creation of concurrent processes. It also consists of a specification and a body. The Task specification contains the entry declarations, regulating the communication between the processes. These entries are used by other tasks to initiate a rendezvous, in a way similar to the call of a procedure. The body of the task contains an accept statement for each entry. The actual rendezvous takes place when the initiating

(40)

41

-task has executed the call and when the called -task reaches the accept statement. This accept statement contains a declaration with formal parameters and a body, much like a standard procedure. During the rendezvous only the body of the accept statement is executed and the calling task is waiting for it to finish. New demands for a rendezvous, either at the same entry or at another entry, are queued.

Unfortunately, the possibility to execute sequential code during the rendezvous, though useful in systems programming, is a disadvantage when ADA programs are executed on a large-scale parallel processor.

Furthermore, ADA has no concepts for the explicit allocation of processes to specific processors nor for the definition of how to use communication hardware. Efficient static scheduling of ADA processes is in general impossible. Hence, ADA is most suited for concurrent programming of multiprocessors with a shared memory for program and data. Each idle processor may then look in shared memory for inactive processes ready to become active again.

3.5 OCCAM

A new programming language especially designed for multicomputers, OCCAM [TAYL82], uses two fundamental elements in the design of a program: Processes and Channels. The Channel is a one direction communication link, connecting two processes and thereby automatically taking care of synchronization.

OCCAM relies on the hardware, e.g. the Transputer, to make the implementation of processes cheap enough to enable the replacement of functions normally accomplished by procedure calls, by concurrent processes. Hence, a process, responsible for the execution of a prescribed activity, can have formal parameters like a function or a procedure, used in a sequential language.

Compared with ADA, the main advantage of OCCAM is the simplicity of the language. In OCCAM everything is a process and

(41)

only a few ,but strong, concepts are available. This makes OCCAM relatively easy to learn and improves the implementation of OCCAM programs on a parallel architecture.

The concept of an OCCAM channel is more restricted than the communication between ADA tasks. Because the channel is a dedicated link between two processes, it is easily mapped on point to point communication hardware. The effective utilization of parallelism between processes is also better enforced in OCCAM, since no sequential code is executed in the rendezvous between processes. In the ADA implementation large pieces of code may be present in the critical region part of the accept statement.

The programming of a Transputer network is not much different from the method used on the DPP81 and as yet also on the DPP84. The user has to write a program for each node and explicitly define the communication between these programs. The only difference is that Inmos uses a formally designed language -OCCAM- while on the DPP the PE programs have to be written in a stack-oriented MACRO language [RUIG85]. It is obvious that the OCCAM toolset is more professional.

However, OCCAM, like C, is a language for system programming. The user has to take care of many details concerning the implementation on a multi-Transputer architecture. Hence, OCCAM does not enforce a clear parallel structure of the program and as a result allocation of processes and channels is a separate programming activity.

In OCCAM it is necessary for the programmer to use specific statements to enforce both placement of processes (Placed Par) and placement of channels. The resulting program level is called the harness. Because global variables are allowed, actual partitioning of the OCCAM program in suitable modules to be allocated to the processors, is by no means a trivial task.

Both the use of global variables and the lack of enforcement of the structuring of a set of processes are also bottlenecks for the automatic generation of a harness. Heuristic approaches for task assignment [EFE82] may find these constraints to inhibit acceptable solutions.

(42)

4 3

-3.6 Parallel structuring

In the preceding sections, the problems in using concurrent programming constructs for the programming of multicomputers have been discussed. The term concurrent programming was used as a synonym for an asynchronous programming model [MIKL84] of processes.

The synchronous approach in this section results in a more user-friendly programming concept for the expression of parallelism. First it is necessary to abandon the concept of memory as a resource, e.g. the sharing of variables. The control of access to shared data inevitably leads to a dynamic process concept. Because processes may be inactive (waiting on each other), constructs used for the explicit definition of parallel processes do not necessarily express the run-time parallelism available.

A parallel construct should be used to declare the independence of the computational entities described in the body of the construct. Hence, the program:

for i:=1 until N do parallel

X1 :=X2+C ; X2:=X1+D ; end for

is equivalent to the program: for i:=1 until N do

begin

NEW(X1):=X2+C; NEW(X2):=X1+D; end

(43)

The keyword NEW, as used in data flow languages [ACKE82], indicates the assignment of a new value to the variable after all statements in the loop have been executed. For the advantages of using the NEW construct in this loop, I refer to [SIPS84]. This NEW construct is, however, only useful in combination with the single assignment rule and even then dependencies may exist between statements within the loop.

Like the NEW concept, the construct for parallelism used above results in a synchronous compute-communicate-compute cycle, but at the same time it also enforces structure on a program. Of course, this imposes restrictions in one's freedom of programming, which is true for any high level programming language when compared to assembler. Whether these restrictions are limiting the use of a language depends completely on the other concepts available in the language and how they compensate those restrictions.

The application of this concept for explicit parallelism on some engineering problems can be found in [DEKK83b].

The remainder of this section will emphasize on the structure of parallel programs. Note that the scope of the parallel statement is confined within the iterative construct, or is ended, as will be demonstrated, by the next parallel statement.

For instance, execution of the following program: for i:=1 until N do

begin NEW(X1):=X4+C ; X3=X2+X1 ; NEW(X2):=X3+D ; X4=X1+E; end

(44)

4 5

-is understood more quickly, when written as such: for i:=l until N do

parallel X3=X2+X1 ; X4=X1+E; X1 :=X4+C ; parallel X2:=X3+D; end for

The repeated use of parallel statements might be confusing at first. The body of the loop consists of a sequence of parallel constructs. The use of a new parallel statement ends the scope of the preceding parallel statement.

No sequential statement should be available. A program unit is either sequential or a sequence of parallel constructs. Hence, it should not be possible within one program unit to mix sequential code with parallelism and neither to build a hierarchy of parallel constructs. In this way the structure of each program unit stays simple and, when more complex structures are needed, the use of modular programming techniques is stimulated.

As discussed, current languages, even the object-oriented ones, are not sufficient in their support for modular programming. Hence, the quality of software is still mostly dependent on the methodology used in the decomposition of the problem [PARN72].

Since most concepts that contribute to a high quality structure of a modular program, such as cohesion and coupling of modules [BERG81], are not enforced by the language, they depend on the programming style. This lack of support for program partitioning limits the use of explicit parallelism in the creation of sufficient global parallelism.

Hence, the main objective of a programming language for parallel processors should be the use of explicit parallelism in the description of the structure of a program and the enforcement of hierarchical modularity. The lack of structural enforcement, in

(45)

languages currently used in parallel processing, is the result of both the use of shared or global [WULF73] variables and the explicit programming of interprocess communication (dependencies).

The Hierarchical Programming System described in the next chapter, is based on the higher level approach of program structuring in a spatial and time order hierarchy, made possible by generic use of modules. This approach has already proven its value in the area of Simulation [DEKK84a].

(46)

-47-Chapter 4

Hierarchical Programming System

4.1 Programming systems

The growing complexity and cost of software is stimulating the development of software engineering environments [HEND85].

Originally, most research on programming systems has been aimed at programming in the small, e.g the development of syntax directed editors and other tools [WATE85] to aid the programmer in the development and documentation of applications. These tools both improve the speed of programming and the maintenance of the resulting code.

In the last decade there is also a large effort to develop support for programming in the large [RAMA86]. The aim of these environments, such as the ADA Programming Support Environment (APSE) [BRAU81], is to support large teams of programmers in the development of software for an embedded system.

The use of the name Hierarchical Programming System might give the impression that it is related to the above mentioned programming environments.

However, the emphasis of these traditional programming environments is mostly on compilation of programs. Unfortunately, the use of compilation techniques does not provide enough possibilities for interaction, especially in application areas such as Simulation and Computer Aided Design. For an engineer using a

(47)

computer it is often not sufficient to change only coefficients or input data of a program.

On current systems, changing a program introduces time delays owing to recompilation and linking, which interrupts the user's train of thought.

The use of Direct Executable Languages, like Microdare, Desire and Desktop developed at the University of Arizona [KORN85], offers a solution for small applications. These systems use a combination of an interpreter and a minicompiler to achieve a direct execution of the program and are therefore based upon the idea that the main involvement of a user with his computer is on the level of user programming.

Next generation computer systems should allow a user to interact with his machine on a higher level. Their programming system should be based on integrated environments, such as Interlisp [TEIT81], which encourage the user to experiment and to use already tested and therefore reliable programs in the construction of his own application. In case a user is not really content with the behaviour of his program, he can go down one level to have a look at the way to use the building blocks of that level (or even lower levels) to improve his program: True stepwise refinement.

The aim of the development of a Hierarchical Programming System is to extend the modular programming style, made popular by Modula-2 and ADA, to an integrated environment suited for large-scale parallel processors running large-scale applications.

The emphasis of a high level programming system should not be on

the implementation of algorithms; such a system should provide the means to understand and manipulate complex systems and components [WIN079].

The only way to control complexity is to spend more attention on the relations between parts of the system [PRYW86] than on analysis of these parts themselves.

(48)

4 9

-To enable the userfriendly modification of applications in this way, it is necessary to enforce a clear structure on a global level through a hierarchical ordering of modules (section 4.2).

In addition to the global parallelism provided as such, the presence of structure [BERG81] enables the clean separation between the userinterface and the body of the application, with all the advantages [BRAN83] this provides.

The user interface should thus be separately incorporated in the programming system (programming by exception) and the user only needs to supply the hooks for this interface in each specific application.

The Hierarchical Programming System is an integrated environment for:

-interactive programming

-explicitly stating the hierarchy of tasks (in space as well as in time [DEKK83a]), using

-either directly loadable tasks from a task data base -or defining a new task by using a high level language.

Basic tasks in this programming system are procedures, executed within one PE: program primitives. Their use in parallel processing dictates the use of the call by value mechanism [STAN82] for parameter passing between program primitives, i.e. only a value is transferred to the primitive.

The flexibility of Unix, derived by enabling input and output redirection on the process level, will be achieved in the Hierarchical Programming System on the procedural level by performing a redirection of parameter passing. The code for transferring the value of the output variable of one program primitive to the input parameter of another program primitive is produced by the operating system (chapter 5).

(49)

The expected effect of these developments on programming is most easily demonstrated by looking at the impact of the UNIX programming environment [KERN81]. The redirection of program input and output together with the connection of programs through pipes, resulted in the use of small and specialized programs (filters). The command interpreter used to initiate complex structures of concurrent running tasks in UNIX, the Shell, is itself a programming language. Hence, users are allowed to combine several programs in a new program.

The actual userinterface of the Hierarchical Programming System may depend on the application area. For use in simulation a prototype for Successive Model Decomposition (SMD tool) [DEKK84a] has been created [GELD84]. The detailed execution of program primitives can be described by the way of sequential languages, with FORTRAN and C as likely candidates.

To increase the portability óf applications from this programming system to other environments, the building of a program generator has been foreseen.

4.2 Hierarchy of modules

A Hierarchical Programming System encourages the hierarchical partitioning of tasks in a top-down fashion, to achieve an improved functional performance. The term hierarchical is here used in a strict sense. A program will only have a true hierarchical structure, when the modules used on a certain level are free of side effects, i.e. execution of one module does not change any variable in another module. Hence, all connections between these modules are either on this level or on higher levels of the structure.

The resulting structure is shown in figure 4.1 and has the form of a simple tree.

(50)

51

-upper level

Note: The modules at a certain level are completely encapsulated by the next higher level.

Figure 4.1 Hierarchical Structure of Program Primitives.

For reasons of efficiency, the hierarchical structure of the program should not be retained in the run-time code. Only the bottom primitives need to be accessible during the execution of the task system. The hierarchical structure on top should only be visible for the (user) interface.

It should be noted, that the significant overhead (figure 4.2), resulting from the use of computational abstraction in those programming languages that use the call by value mechanism, can be avoided by the implementation of a pure hierarchical environment.

To achieve this efficiency the enforcement of a pure hierarchical structure should be supplemented by a hierarchical scope concept (figure 4.3). Input parameters of a primitive should be made accessible (by explicit naming) at all higher levels.

(51)

Procedure A

Procedure B Procedure C

Note: Procedure A calls B and C.

Procedure B consumes data object a and produces b. Procedure C consumes data object b and produces c.

Figure 4.2: Data flow in conventional procedural language.

There are two different implementation features of this hierarchical scope concept. First one may assign or change the default value at higher levels. When more than one default value is assigned, the one at the highest level is valid. Assignment of default values at lower levels should then be removed before execution of the task system.

The other possibility limits the scope of a parameter to the current level. Hence, the interface of a module will be changed when this feature is used during the maintenance of that module. Of course, this can be easily checked when the interface is separately defined, as is common in modern languages.

(52)

-53-Compound Primitive E . .

All data objects visible

Compound Primitive D

Write permission data objects a and c Read permission data objects b and d

Data objects a,b and c,d visible; e and f not visible

Primitive A Input a Output b Primitive B Input c Output d Primitive C Input e Output f

Note: Accessibility of objects depends on place in hierarchy: Although e an f are visible in compound primitive E, they are not visible in D.

When Compound Primitive D connects input c of primitive B with output b of primitive A, scope of c is restricted and c can no longer be accessed by compound primitive E.

Figure 4.3: Hierarchical scope.

With this restriction in mind, it is possible to declare the input parameter of a primitive connected (equivalent) with an output parameter of another primitive. Hence, whenever the output value is updated the input parameter is changed accordingly. Data are transferred directly from producer to consumer (figure 4.4). The value of an output variable is only copied to a higher level when it is actually needed at that level.

(53)

/

/ B

b

\

c ^ \ C A

Note: Compare the data flow (b) directly from B to C with the traditional data flow in figure 4.2.

Figure 4.4: Compound primitive A is built from primitives B and C.

Together with all the input parameters of lower level modules, a high level module may have an extremely large interface. To ensure readability of a program it is necessary to hide some of the details. This is most efficiently done by enabling higher program levels to use data abstraction on the formal parameters of the lower level.

4.3 Synchronous orimitives programming model

To enable a user/programmer to find out the consequences of his interaction with a programming system, it is necessary to present him with an easy-to-understandf and easy-to-use model of parallel

computation. This model must use a sufficient level of abstraction to hide the many details concerning the efficient implementation on different parallel architectures. In particular, it should not be necessary for the programmer to supply information about the mapping of parallelism on a specific architecture.