Run-time Adaptable VLIW Processors: Resources, Performance, Power Consumption, and Reliability Trade-offs

(1)

Run-time Adaptable VLIW Processors

Resources, Performance, Power Consumption, and Reliability Trade-offs

(2)

(3)

Resources, Performance, Power Consumption, and Reliability Trade-offs

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen

op dinsdag 27 augustus 2013 om 15:00 uur

door

Fakhar ANJAM

Master of Science in Information Technology

Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad

(4)

Copromotor:

Dr. ir. J.S.S.M. Wong

Samenstelling promotiecommissie:

Rector Magnificus voorzitter

Prof. dr. K.L.M. Bertels Technische Universiteit Delft, promotor Dr. ir. J.S.S.M. Wong Technische Universiteit Delft, copromotor Prof. dr. E. Charbon Technische Universiteit Delft

Prof. dr. L. Carro Universidade Federal do Rio Grande do Sul, Brazilië Prof. Dr.-Ing. H. Blume Leibniz Universität Hannover, Duitsland

Prof. Dr.-Ing. M. Hübner Ruhr-Universität Bochum, Duitsland Prof. dr. ir. G.N. Gaydadjiev Chalmers University of Technology, Zweden Prof. dr. G.J.T. Leus Technische Universiteit Delft, reservelid

This thesis has been completed in partial fulfillment of the requirements of the Delft University of Technology (Delft, The Netherlands) for the award of PhD degree. The research described in this thesis was supported in parts by: (1) CE Lab. Delft University of Technology, (2) HEC Pakistan.

Published and distributed by: Fakhar Anjam Email: imfakhar@gmail.com

ISBN: 978-94-6186-191-7

Keywords: Computer Architecture, Parallel Execution, Softcore Processors, VLIW Processors, Run-time Reconfiguration, Fault Tolerance, Customization, Parametriza-tion, FPGAs, Trade-offs

Cover page designed by Hanike (www.hanike.nl). Copyright c 2013 Fakhar Anjam

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without permission of the author.

(5)

(6)

In this dissertation, we propose to combine programmability with reconfig-urability by implementing an adaptable programmable VLIW processor in a reconfigurable hardware. The approach allows applications to be developed at high-level (C language level), while at the same time, the processor organiza-tion can be adapted to the specific requirements (both static and dynamic) of different applications.

Our proposed customizable VLIW processor calledρ-VEX can be adapted at design-time as well as at run-time. Its instruction set architecture (ISA) is based on the VEX ISA and a toolchain (parametrized C compiler and sim-ulator) is publicly available from Hewlett Packard (HP) for architectural ex-ploration and code generation. The design-time parameters include the pro-cessor’s issue-width, the type of different functional units (FUs) and their la-tencies, the type and size of multiported register files, degree of pipelining, size of instruction and data memories, type of interrupt and exception systems, selection of default custom operations, datapath sharing. If the behavior of applications is not known at design-time or an application has different phases with distinct requirements, a fixed processor may not perform efficiently for all the applications/phases. To this end, we propose a run-time reconfigurable processor that can adapt its organization dynamically during execution. The run-time parameters include the processor’s issue-width, the type and number of different FUs, and the register file size. Additionally, we propose config-urable fault tolerance techniques for theρ-VEX processor. The designer can choose to include or exclude the fault tolerance in the processor at design-time. When the fault tolerance is included, it can be made permanently enabled or enabled/disabled at run-time. All these options enable users to trade-off be-tween hardware area/resources, performance, power/energy consumption, and reliability. The processor is available as open-source.

(7)

In dit proefschrift stellen we voor om programmeerbaarheid te combineren met reconfigureerbaarheid door het implementeren van een aanpasbare pro-grammeerbare VLIW processor in herconfigureerbare hardware. De aanpak staat het ontwikkelen van toepassingen op hoog niveau (C programmeer taal-niveau) toe, terwijl op hetzelfde moment de processor organisatie kan worden aangepast aan de specifieke eisen (zowel statisch als dynamisch) van verschil-lende toepassingen.

Onze voorgestelde aanpasbare VLIW processor, genaamdρ-VEX, kan tijdens design-time evenals tijdens run-time aangepast worden. De instructie set archi-tectuur (ISA) is gebaseerd op de VEX ISA en een toolchain (geparametriseerde C compiler en simulator) is publiek beschikbaar gesteld door Hewlett Packard (HP) voor architectuur exploratie en code generatie. De design-time parame-ters omvatten de processor issue-breedte, de aard van verschillende functionele eenheden (FU’s) en hun latencies, het type en grootte van multiported regis-ter files, de mate van pipelining, de grootte van instructie en data geheugens, het type interrupt en exceptie systemen, selectie van standaard aangepaste bew-erkingen, het delen van het datapad. Indien het gedrag van applicaties niet bek-end is tijdens design-time of wanneer een applicatie verschillbek-ende fases kent met verschillende eisen, kan het zijn dat een vaste processor niet efficiënt is in het uitvoeren van alle applicaties/fasen. Daartoe stellen we een run-time her-configureerbare processor voor die zijn organisatie tijdens het berekenen dy-namisch kan aanpassen. De run-time parameters omvatten de processor issue-breedte, het type en aantal verschillende FUs, en het register bestandsgrootte. Daarnaast stellen we voor deρ-VEX processor herconfigureerbare fouttoler-antie technieken voor. De ontwerper kan kiezen voor wel of geen fouttolerfouttoler-antie in de processor tijdens design-time. Wanneer fouttolerantie is inbegrepen, kan deze permanent ingeschakeld worden of ingeschakeld/uitgeschakeld tij-dens run-time. Al deze opties geven de gebruikers de mogelijkheid om een afweging te maken tussen hardware area/resources, prestatie, stroom/energie verbruik en betrouwbaarheid. De processor is als open-source beschikbaar.

(8)

1. All hardware and software should be reconfigurable.

2. Hardwired multiported memories are a must for the efficient implemen-tation of parallel hardware in FPGA.

3. Software comes from heaven when you have good hardware. (Ken Olsen)

4. The distinction between VLIW and superscalar processors is vanishing. 5. Normal life starts after the PhD study.

6. A good idea means nothing by itself; a good implementation is equally important.

7. Will is more important than competence to achieve something. 8. You are not doing research when you know what you are doing. 9. “Freedom of expression” should not be considered as unlimited. 10. Without improving the primary education system in Pakistan, spending

billions in higher education is of little use.

11. Tolerance is the only thing the Pakistani nation needs nowadays. 12. A good way to learn new things is to be unlucky.

These propositions are regarded as opposable and defendable, and have been approved as such by the promotor Prof. dr. K.L.M. Bertels.

(9)

1. Alle hardware en software zou herconfigureerbaar moeten zijn.

2. Hardwired multiported geheugens zijn een vereiste voor het efficiënt im-plementeren van parallel hardware op FPGA.

3. Software komt van de hemel wanneer je goede hardware hebt. (Ken Olsen)

4. Het verschil tussen VLIW en superscalar processoren is aan het verdwi-jnen.

5. Het normale leven starts na de PhD studie.

6. Een goede idee betekent opzichzelfstaand niets, een goede implemen-tatie is even belangrijk.

7. Wil hebben is belangrijker dan competentie om iets te bereiken.

8. Je bent geen onderzoek aan het doen als je weet wat je aan het doen bent. 9. "Vrijheid van meningsuiting" moet niet als onbeperkt worden

beschouwd.

10. Zonder het verbeteren van het primair onderwijs in Pakistan is het uit-geven van miljarden in hoger onderwijs van weinig nut.

11. Vandaag de dag is tolerantie het enige dat de Pakistaanse natie nodig heeft.

12. Een goede manier om nieuwe dingen te leren is om een pechvogel te zijn.

Deze stellingen worden opponeerbaar en verdedigbaar geacht en zijn als zo-danig goedgekeurd door de promotor Prof. dr. K.L.M. Bertels.

(10)

Here comes the end to my formal student life. That was a fun by itself. Find-ing this opportunity, I would like to express my gratitude to all those who contributed directly or indirectly to the work reported in this thesis.

First of all, I would like to thank my supervisor Stephan Wong who provided me the opportunity to perform research in the Computer Engineering (CE) Lab. His guidance and consistent involvement in all phases of my PhD research project is truly remarkable. We had many brainstorming sessions and long technical meetings that helped me a lot in my work. Special thanks go to the promotor of my thesis Koen Bertels. He always offered his help through out my stay at the university. I am also very grateful to all the faculty members of CE who provided me help and guidance from time to time.

The members of my PhD committee also deserve appreciation. I thank them for devoting some of their time to read my thesis, providing me their valuable comments, and traveling to Delft for the public defense of this dissertation. Special thanks go to Luigi Carro for his discussion and collaboration through out the research project.

The Higher Education Commission (HEC) of Pakistan partially sponsored the research reported in this thesis. I would like to thank all the staff at HEC who was always available whenever I needed their help. I am also very grateful to my former boss Saif shb and colleagues Atif shb, Sajid shb and Yaseen shb for their encouragement and help. In the Netherlands, the NUFFIC and CICAT deserve appreciation. I am very grateful to Loes, Charlene and all other staff at NUFFIC for providing their support. Franca from CICAT deserves special thanks for taking care of my visa related and financial issues.

Appreciation goes to Roel Seedorf, Anthony, Arash, Roel Meuws, Catalin, Dimitri, Yi Lu, Thomas, Zaidi and all other colleagues at the CE Lab for the long discussions we had. I thank them for providing a friendly and research conducive environment. Special thanks go to Motaqi for translating the prepo-sitions and summary of this thesis. Mota! You are a great person, always

(11)

the social events that are an integral part of life at the CE Lab. Thanks to the organizing members.

I was lucky to have so many Pakistanis around during my stay in Delft. With them I never felt away from home. These include Mehfooz, Hamayun, Laiq, Mazhar, Nadeem, Husnul Amin, Yahya, Hamid, Usama, Umer Ijaz, Saleem, Zubair, Tariq, Hisham, Rafi, Sharif, Umer Naeem, Adeel, Hanan, Fahim, Shah, Rajab, Tabish and all other whose names are not mentioned. Special thanks go to Cheema, Atif, Bilal, Dev, Faisal Kareem, Sandilo, Faisal and Seyab for my early day’s help out. We had a very good company living in Poptahof, gossiping and playing cards the whole nights. Great appreciation goes to Imran for his delicious cooking. The social events and get together that we had will always be remembered. Special thanks go to all friends who arrange to play cricket every weekend.

Back in Pakistan, there were many people who encouraged me and prayed for my success. Thank you all. My family deserves great appreciation. Although my mother has been very sick in my absence, but she always prayed for me. I wish she get well soon. My brothers and sisters, cousins and all other family members have supported me well in their capacity. Whenever I spoke to them on phone, I felt myself more energetic. Uneeza, Abru, Manahil, Dua, Fateh, Momin, Aman and Mateen! You kept me alive whenever I spoke to you and visited you. Bushra deserves very special recognition for being very support-ive. After she came in my life, fortune has favored me. She always stood with me and that is why I have never let her feel alone. I thank her for her under-standing and cooperation whenever I got busy and had little time for her. We have had a very memorable time together. My love and prays are always for her. I would not forget to mention my father here. He is the source of all my inspirations. It was his vision that today I have completed my PhD thesis. Finally, I would offer my thanks to the Dutch Society in general. People are very friendly and supportive. I really enjoyed my stay in the free and open environment. Going for long biking trips was a fun. Staying at Delft was a fantastic time. I will always remember it.

Fakhar Anjam Delft,

Eid-ul-fitr, August 08, 2013

(12)

(13)

Summary . . . i

Propositions . . . . iii

Acknowledgments . . . . v

Table of Contents . . . viii

List of Figures . . . . xii

List of Tables . . . xvi

List of Acronyms and Symbols . . . xviii

1 Introduction . . . . 1

1.1 Background . . . 2

1.1.1 General-purpose and Embedded Processors . . . 2

1.1.2 Processor Design Architectures . . . 2

1.1.3 Different Forms of Processor Parallelism . . . 4

1.1.4 Architectures to Exploit ILP . . . 5

1.1.5 Programmability and Reconfigurability Together . . . 6

1.2 Scope . . . 8

1.3 Open Questions . . . 10

1.4 Methodology . . . 12

1.5 Dissertation Organization . . . 13

2 Background . . . . 15

2.1 Adaptable VLIW Processor . . . 16

2.1.1 Motivations . . . 16

2.1.2 The VEX System . . . 19

2.1.3 The Initial Design ofρ-VEX VLIW Processor . . . 20

2.2 Related Work . . . 23

2.2.1 Configurable RISC Softcore Processors . . . 23

(14)

2.2.4 Our Proposal . . . 31

2.3 Summary . . . 32

3 Design-time Configurable Processor . . . 33

3.1 Design-time Configurableρ-VEX VLIW Processor . . . 34

3.2 Multiported Register Files . . . 37

3.2.1 Register Files with FPGA’s Configurable Resources . . 38

3.2.2 Register Files with FPGA’s Embedded BRAMs . . . . 40

3.2.3 Evaluation of the Register File Designs . . . 44

3.3 Support for Interruptability . . . 45

3.3.1 Interrupt Handling System . . . 46

3.3.2 Implementation Styles for the Interrupt Controller . . . 48

3.3.3 Interrupt Latency and Response Time . . . 49

3.3.4 Exceptions Handling System . . . 51

3.3.5 Implementation Results . . . 52

3.4 Instruction Encoding Scheme . . . 53

3.4.1 Design of the New Encoding Scheme . . . 53

3.4.2 Borrowing Scheme and Instruction Mapping . . . 54

3.5 ISA Extension Support . . . 55

3.5.1 Binary Code Generation for Custom Operations . . . . 56

3.5.2 Methodology to Extend the ISA . . . 58

3.5.3 Design-time Selectable Custom Operations . . . 59

3.6 Datapath Sharing . . . 62

3.6.1 Dual-processor System . . . 62

3.6.2 Datapath-shared Dual-processor System . . . 63

3.6.4 Related Work . . . 65

3.7 Summary . . . 66

4 Run-time Reconfigurable Processor . . . . 67

4.1 Run-time Reconfigurable/Adaptable Processor . . . 68

4.1.1 Reconfiguration Flows . . . 69

4.1.2 Design of the Run-time Reconfigurable Processors . . 70

4.1.3 Memory System . . . 75

4.1.4 Mechanism for Issue-width Adjustment . . . 77

(15)

4.2.1 Case Study for 4-issueρ-VEX Processor . . . 81

4.3 Run-time Task Migration . . . 82

4.3.1 Design of the Task Migration Scheme . . . 83

4.4 Simultaneous Reconfiguration of Issue-width and Instruction Cache . . . 88

4.4.2 Characteristics of the Reconfigurable Processor . . . . 90

4.4.3 Characteristics of the Reconfigurable Instruction Cache 91 4.4.4 Energy Estimation . . . 92

4.5 Summary . . . 93

5 Configurable Fault Tolerance . . . . 95

5.1 Introduction and Motivations . . . 96

5.2 Related Work . . . 97

5.3 The Baseρ-VEX Processor . . . 98

5.4 The Fault-Tolerantρ-VEX Processor . . . 98

5.4.1 Instruction Memory . . . 99

5.4.2 Data Memory . . . 99

5.4.3 GR Register File . . . 100

5.4.4 TMR Approach for all Flip-Flops . . . 102

5.4.5 Working of the Configurable Fault-Tolerant System . . 102

5.4.6 Fault Coverage and Test Methodology . . . 104

5.5 Implementation Results and Discussion . . . 106

5.5.1 Hardware Resources/Area and Critical Path Delay . . 106

5.5.2 Dynamic Power Consumption . . . 109

5.6 Summary . . . 111

6 Results and Analysis . . . 113

6.1 2-4-issue Processor . . . 114

6.2 2-4-8-issue Processor . . . 116

6.3 Power Consumption for Stand-aloneρ-VEX Processors . . . . 119

6.4 Run-time Task Migration Support . . . 120

(16)

6.5.1 Experimental Setup and Benchmark Applications . . . 124

6.5.2 Results and Analysis . . . 125

6.6 Multiport Data Memory/Cache Analysis . . . 134

6.6.1 Local Data Memory . . . 134

6.6.2 Data Cache . . . 136

6.7 Summary . . . 137

7 Conclusions . . . 139

7.1 Summary . . . 139

7.2 Main Contributions . . . 143

7.3 Future Research Directions . . . 144

Bibliography . . . 147

List of Publications . . . 160

Curriculum Vitae . . . 163

(17)

2.1 4-issue non-pipelinedρ-VEX VLIW Processor. . . 21 3.1 Methodology to generate an instance of theρ-VEX processor. 36 3.2 Hardware results for different versions of the 64×32-bit GR

register files with different ports. In addition to the mentioned resources, version 3 of the 2W4R, 4W8R, and 8W16R reg-ister files also utilize 8, 32, and 128 RAMB18s, respectively. Similarly, version 4 of the 2W4R, 4W8R, and 8W16R register files utilize 2, 8, and 32 RAMB18s, respectively. . . 39 3.3 Implementation results for multi-issue pipelined ρ-VEX

pro-cessors with different versions of the GR register files. In ad-dition to the mentioned resources, version 3 of the 2-issue, 4-issue, and 8-issue processors also utilize 8, 32, and 128 RAMB18s, respectively. Similarly, version 4 of the 2-issue, 4-issue, and 8-issue processors utilize 2, 8, and 32 RAMB18s, respectively. The 2-issue, 4-issue, and 8-issue processors also utilize 4, 4, and 8 DSPs modules, respectively. . . 41 3.4 A single-pumped 4W8R ports BRAM-based register file. . . . 42 3.5 A 4-issueρ-VEX processor with the interrupter. . . 47 3.6 Dataflow and FSM in the interrupter. . . 48 3.7 Implementation results for the 4 types of interrupt system

with 4-issue ρ-VEX processor (3 with ρ-VEX type ’a’ and 1 with type ’b’) for a Virtex-6 FPGA. Eachρ-VEX processor (with/without interrupts) also utilizes 4 DSP48E1 modules. . . 52 3.8 Prototypes for the _asm() intrinsics [1]. . . 58 3.9 The _asm() usage example for implementing a division (DIV)

function and its VEX assembly code for a 2-issueρ-VEX pro-cessor. . . 58

(18)

ble 3.7 for a 4-issue ρ-VEX processor with 4 ALU, 2 MUL, and 1 MEM units for the Virtex-6 FPGA. The processor also requires 32 RAMB18s and 4 DSP48E1s modules. . . 61 3.12 VLIW dual-processor systems. . . 63 3.13 Implementation results (slices) for the base 4-issue

non-pipelined ρ-VEX processor’s modules for the Virtex-II Pro FPGA. The complete processor requires 14561 slices and 14 MULT18X18s. The register file is 64×32-bit. . . 64 3.14 Implementation results for the dual-processor system (shared

and non-shared) for a Virtex-II Pro FPGA. Apart from the slices, the datapath-shared and non-shared dual-processor sys-tems also require 14 and 28 MULT18X18s, respectively. The BRAM-based design also utilizes 64 RAMB18s. . . 65 4.1 Execution cycles for matrix multiplication, SHA, and Qsort

applications. . . 69 4.2 Execution units in different issue-slots. . . 70 4.3 General view of the run-time reconfigurable issue-slots

pro-cessor. . . 71 4.4 256×32-bit 8W16R ports register file for the 2-4-8-issue

pro-cessor. . . 74 4.5 Instruction and data memories for the 2-4-8-issue processor. . 76 4.6 Design and hardware resource utilization for the 2-4-issue

re-configurable processor for the Xilinx Virtex-II Pro XC2VP30-7FF896 FPGA. . . 78 4.7 Virtex-II Pro FPGA’s slice utilization for 64×32-bit 4W8R

ports register file and 4-issue non-pipelinedρ-VEX processor. 81 4.8 Design and hardware resource utilization for the dynamically

reconfigurable ρ-VEX processor. Apart from the slices, the static region also utilizes 14 MULT18X18s, and some BRAMs for instruction and data memories. . . 82 4.9 A task migration example. . . 83

(19)

4.11 Mechanism for task migration in the 2-4-8-issue adaptable processor. . . 86 4.12 Instructions per cycle (IPC) for different applications [2] [3]. . 89 5.1 Two approaches used for TMR. . . 103 5.2 Implementation results for theρ-VEX processors for the

Xil-inx Virtex-6 FPGA. In addition to the mentioned resources, the 2-issue, 4-issue, and 8-issue cores utilize 4, 4, and 8 DSP48E1s modules, 4, 16, and 64 RAMB36s (GR register file version 3), and 1, 4, and 32 RAMB36s (GR register file version 4), respectively. . . 107 5.3 Synthesis results for theρ-VEX processors for 90 nm

technol-ogy. . . 108 5.4 Dynamic power consumption per MHz for theρ-VEX

proces-sors. . . 110 5.5 Percent dynamic power reduction for the D4 designs

com-pared to D2. . . 111 6.1 Speedup for the 2-4-issue processor normalized to 4-issue core. 115 6.2 Speedup for the 2-4-8-issue processor normalized to 2-issue

core. . . 117 6.3 Execution cycles normalized to the four 2-issue cores for the

Rijndael encryption/decryption algorithms. . . 118 6.4 Dynamic power consumption for the 2-4-8-issue processor. . . 119 6.5 Dynamic power consumption for the stand-aloneρ-VEX

pro-cessor with different issue-widths and different types of regis-ter files. . . 120 6.6 Execution cycles normalized to a 2-issue core with 1

load/store (LS) unit. . . 122 6.7 Dynamic power consumption for the 2-4-8-issue processor

with task migration support. . . 123

(20)

issue and 1W8KB16B I-cache. . . 126 6.9 Impact of simultaneous reconfiguration of issue-width and

I-cache; execution cycles, energy, and EDP for 2-issue, 4-issue, and 8-issue cores with varying I-cache normalized to own issue-width with the base I-cache in each set. . . 127 6.10 I-cache configurations for which execution cycles remain the

same but energy consumption and EDP vary. . . 128 6.11 Percentage variation in energy, execution cycles, and EDP for

4-issue core compared to 2-issue core with different I-caches. . 129 6.12 Percentage variation in energy, cycles, and EDP for 4-issue

and 8-issue cores compared to 2-issue core with different I-caches for the Rijndael encode. . . 130 6.13 Execution cycles, energy consumption, and EDP for the

4-issue and 8-4-issue cores normalized to 2-4-issue core (all with their best I-caches). . . 132 6.14 2R2W ports data memory configuration implemented with

BRAMs. . . 135 6.15 Number of BRAMs (Xilinx RAMB18s) required to

imple-ment 1-way data cache memory (data store + tag store) with multiple read/write ports. . . 136

(21)

1.1 Relative characteristics of ASICs, RISC (single-issue), CISC, VLIW, and Superscalar processors. . . 3 1.2 Differences between superscalar and VLIW processors [4]. . . 6 3.1 Implementation types for GR register files . . . 38 3.2 Implementation results for 64×32-bit 4W8R ports register file

with register renaming and 4-issueρ-VEX VLIW processor. . 44 3.3 Implementation version, interrupt response time, and the

worst-case interrupt latencies for the four types of interrupt system for theρ-VEX processor. . . 50 3.4 The old and the new encoding schemes. IMM is flag for

im-mediate types. Short IMM and long IMM are the values of the short and long immediates, respectively. S_F means Syl-lable_Follow custom operation. . . 54 3.5 Position of FUs, borrowing scheme for long IMM, and

in-struction mapping for the 2-issue and 4-issueρ-VEX proces-sors. Here, AU, MU, MM, CT, S, and L mean ALU, MUL, MEM, CTRL, short, and long, respectively. . . 56 3.6 Positions of FUs, borrowing scheme for long IMM, and

in-struction mapping for the 8-issue and 2-4-8-issueρ-VEX pro-cessors. Here, AU, MU, MM, CT, S, and L mean ALU, MUL, MEM, CTRL, short, and long, respectively. . . 57 3.7 List of design-time available custom operations. . . 61 4.1 Distribution of registers and ports for the 256×32-bit 8W16R

ports GR register file for the 2-4-8-issue processor. . . 75 4.2 Implementation results for the 2-4-8-issue processor for the

Virtex-6 XC6VLX240T-1FF1156 FPGA. . . . 79

(22)

XC6VLX240T-1-FF1156 FPGA. . . . 87 4.4 Typical instruction cache parameters for some famous VLIW

processors. . . 89 5.1 Implementation types for GR register files . . . 101 6.1 Number of BRAMs required for M Kbytes of data memory. . . 135

(23)

ALU Arithmetic Logic Unit

ASIC Application Specific Integrated Circuit BRAM Block Random Access Memory DLP Data Level Parallelism

FF Flip-Flop

FPGA Field Programmable Gate Array FSM Finite State Machine

FU Functional Unit

GPP General Purpose Processor HDL Hardware Description Language ILP Instruction Level Parallelism IPC Instruction Per Cycle ISA Instruction Set Architecture ISE Integrated Software Environment ISR Interrupt Service Routine

LUT Look-Up Table

MT Multi-threading

PE Processing Element RFI Return From Interrupt

RISC Reduced Instruction Set Computer SEU Single Event Upset

SMT Simultaneous Multi-threading SIMD Single Instruction Multiple Data TLP Task Level Parallelism

TMR Tripple Modular Redundancy

UART Universal Asynchronous Receiver Transmitter

VEX VLIW Example

VHDL VHSIC Hardware Description Language VHSIC Very High Speed Integrated Circuits VLIW Very Long Instruction Word

(24)

(25)

1

Introduction

I

n the current-day world, fixed processors (which cannot change their

hard-ware functionality after fabrication) are the mainstream and are made pro-grammable in order to adapt to a large number of applications. As a conse-quence, they perform adequately over a wide range of applications, but not ef-ficiently in terms of performance or energy consumption. Application-specific integrated circuits (ASICs) are designed according to the specific requirements of an application, therefore, they are the most efficient implementation and consume very low power. The major problem with an ASIC is that it cannot be adapted for a different application and has a longer and quite expensive de-velopment cycle. Reconfigurable hardware, such as field-programmable gate arrays (FPGA) can modify their hardware structure. Hence, efficient systems can be implemented in FPGAs due to the flexibility they offer. In general, FP-GAs are programmed using hardware description languages (HDLs), which require the every-day programmers to have intricate knowledge of hardware. Even the use of language translation tools may require rewriting of code lead-ing to longer development time. Now given reconfigurable hardware, can we combine the flexibility of programmable processors with the reconfigurability of FPGAs? Can we design reconfigurable programmable processors that can adapt their functionality to the applications? Can we make designs that can even adapt themselves during run-time? In this dissertation, we try to answer such questions.

The remainder of the chapter is organized as follows. Section 1.1 presents some basic concepts required to understand the questions raised in the disser-tation. The scope of the dissertation is discussed in Section 1.2. Some open research questions are formulated in Section 1.3, which are later on answered in the dissertation. Section 1.4 presents the steps that are followed in order to answer the research questions raised in the chapter. Finally, Section 1.5 provides the organization and structure of the dissertation.

(26)

1.1 Background

In this section, we provide some background knowledge on programmable pro-cessors. We present different processor design architectures, describe different forms of processor parallelism, and then discuss processor architectures that exploit instruction-level parallelism (ILP). Later on, we highlight the benefits of combining programmability and reconfigurability in a single hardware.

1.1.1 General-purpose and Embedded Processors

General-purpose processors (GPPs) are designed without considering the re-quirements of a specific application or task; rather they are designed to perform adequately over a large number of application domains. Their instruction set is general-purpose rather than specialized for a particular task, therefore, they are not very efficient in terms of performance, power, cost, area, etc., across some or all application domains. In addition, they have support for many different kinds of peripherals. Different software can be put on them and hence can be used for different purposes. Mostly, they can be found in today’s PCs, tablets, and servers etc.

Embedded systems include a number of components, where each smaller com-ponent provides a service to the large embedding system. An Embedded pro-cessor (EP) could be one of the components of the embedded system. EPs are utilized in a large number of chips found in, for example, cellular phones, TVs, automotives, biomedical equipments, game consoles, microwaves, and in many other consumer electronic appliances. Generally, these processors are smaller in size and are customized for a particular application or a domain of applications. They can perform the specific tasks more efficiently compared to a general-purpose processor. The different requirements for embedded pro-cessing which are equally important for general-purpose propro-cessing could be performance, power consumption, area, cost, cooling system, reliability, de-pendability, etc.

1.1.2 Processor Design Architectures

A processor (GPP or EP) can be designed according to different architec-tures/philosophies such as the reduced instruction set computer (RISC), com-plex instruction set computer (CISC), very long instruction word (VLIW) or superscalar. A RISC processor has simple and fundamental operations set that

(27)

operates on simple data kept in registers. The only memory-related operations are load and store operations. All the operations can be executed in a single clock cycle. Code size is large and the compiler has more work to do. Nor-mally, RISC processors can issue a single operation every clock cycle. CISC uses complex operations in addition to the simple ones. A complex operation could be a new operation or may be a combination of few fundamental oper-ations. A string move operation, in which a stream of characters stored at a location in memory is moved to another location, is an example of a CISC op-eration. The execution of an operation may take more than one clock cycles. The assembly code resembles to the high-level code. The compiler has less work to do and the code size is smaller compared to a RISC processor. The Intel x86 is an example of the CISC architecture.

VLIW and superscalar processors include multiple parallel execution units to exploit instruction level parallelism (ILP). Both these processors can issue multiple operations in a single clock cycle to increase the performance. The major difference between a VLIW and a superscalar processor is that a VLIW processor relies on a compiler to exploit ILP, while a superscalar processor relies on run-time hardware to exploit ILP. Generally, both of these proces-sors have RISC-like instruction set, but superscalar procesproces-sors with complex instruction set have also been developed. Examples include the in-order super-scalar original Pentium and the out-of-order supersuper-scalar Cyrix 6x86. Table 1.1 presents some characteristics of ASICs, RISC (single-issue), CISC, VLIW, and superscalar processors. Each design philosophy has its own advantages and disadvantages.

Table 1.1: Relative characteristics of ASICs, RISC (single-issue), CISC, VLIW, and Superscalar processors.

Type ASICs RISC CISC VLIW Superscalar Hardware Complexity Medium/High Medium Higher High Highest

Hardware Area Small/Medium Medium High High Highest Power Consumption Small Medium Medium/High High Highest Performance Highest Small/Medium Medium/High High High Compiler Complexity No Compiler Medium Small/Medium Highest Medium/High

Programmable No Yes Yes Yes Yes Code-compatible No Yes Yes No/Small Yes

(28)

1.1.3 Different Forms of Processor Parallelism

In the domain of processors, parallelism refers to the opportunities in a pro-gram to find independent operations and perform them separately in parallel instead of performing them sequentially. There are different forms of proces-sor parallelism which can be thought of as independent of each other. In this section, we briefly discuss the most widely used among them.

ILP: Instruction level parallelism refers to the existence of independent

oper-ations in a program which can be executed together in a single clock cycle. Finding some independent operations in a program or a stream of operations is the job of a compiler in case of a VLIW processor or run-time control hard-ware in case of a superscalar processor. ILP can be combined with any other type of parallelism to further enhance the performance.

DLP: Data Level Parallelism refers to distributing the data across different

parallel computing nodes and executing them in parallel. In this case multiple processing nodes receive a part of the total data and they all execute the same operation on this data. The individual results are then finally combined into a single result. Single instruction multiple data (SIMD) is a form of DLP. SIMD operations operate on the standard registers, but treat them as smaller sub-registers. For example, four 8-bit operations can be performed in a single 32-bit operation in 1 clock cycle which would otherwise require 4 clock cycles.

TLP: Task Level Parallelism refers to executing multiple threads of an

appli-cation on the different processors of a multiprocessor system. A multiproces-sor system consists of multiple similar (homogeneous) or different (heteroge-neous) processing elements. A program is split into multiple, relatively inde-pendent small sub-programs which are executed at the same time on different processors to achieve parallelism. The individual processors may or may not be able to exploit ILP. Programming and compiling for multiprocessors are becoming very complex due to the large number of cores available in today’s multiprocessor systems.

MT: Multi-threading refers to a technique where different programs or parts

of a program (called threads) are executed one by one on a single hardware to show progress on multiple programs or parts of programs. Threads are very light-weight (in terms of state) and pose less serious problems when they are switched. Different policies can be implemented for the sharing the single hardware, such as round-robin, priority-based, FIFO-based, etc. The shared hardware may also be able to exploit ILP in the individual threads.

(29)

SMT: Simultaneous Multi-threading is a special type of multi-threading

avail-able in the superscalar processors. A superscalar processor which does not have support for SMT can issue multiple instructions from a single thread ev-ery clock cycle. In case of the SMT, the superscalar processor can issue in-structions from multiple threads every clock cycle, thus exploiting parallelism available across multiple threads. An example of a processor system which utilizes the SMT technique is graphic processing unit (GPU).

1.1.4 Architectures to Exploit ILP

VLIW and superscalar processors can be used to increase the performance beyond normal RISC architectures. While RISC architectures only take ad-vantage of temporal parallelism (by using pipelining), VLIW and superscalar architectures can additionally take advantage of the spatial parallelism by using multiple functional units (FUs) to execute several operations simultaneously. ILP is determined by considering data dependence in a program and resource availability in hardware. In a superscalar processor, a special control hardware determines the data dependence and resource availability at run-time and then enables the dynamic scheduling of operations. On the other hand, for a VLIW processor, a compiler determines the data dependence and resource availability and statically schedules the operations. In a superscalar processor, the num-ber of issued operations is determined dynamically by the hardware, while the number of issued operations in a VLIW processor is determined statically by the compiler. The window of execution is limited in a superscalar processor which limits the capacity to detect the potentially parallel operations. In case of a VLIW processor, the problem of limited size of execution window does not exist. The compiler of a VLIW processor can potentially analyze the whole program in order to detect parallel operations, hence, increasing the opportu-nities for finding parallelism. Compared to a VLIW processor, the hardware of a superscalar processor is very complex, larger in size, consumes more power, requires larger design efforts, and hence, becomes costly. According to [5], for the same technology and issue-width, the scheduling logic of a superscalar processor alone consumes more power than the entire VLIW processor. That is why a superscalar processor is less attractive for small embedded applica-tions which require small and energy efficient devices. The hardware of a VLIW processor is relatively simple, and can be easily and quickly adapted from product to product at the expense of a complex compiler.

VLIW processors are designed such that the hardware details are more ex-posed to the compiler and ILP is made visible in the machine-level program.

(30)

ILP cannot be seen in the program that is offered to a superscalar processor; rather the hardware can arrange parallelism at run-time even though it is not exposed in the code itself. One of the advantages of a superscalar processor is that a compiled code for a single-issue scalar RISC processor can be executed on a superscalar processor with the same instruction set architecture (ISA). Hence, different superscalar implementations of the same ISA are object-code compatible. That is why superscalar processors are mostly utilized for general-purpose desktops and servers. Because the ILP is exposed in the program it-self, to execute the same application on a VLIW processor, the original source code has to be recompiled for a new implementation/organization of the pro-cessor with the same ISA. Table 1.2 presents the major differences between a superscalar and a VLIW processor as described in [4].

1.1.5 Programmability and Reconfigurability Together

ASICs are designed to match exactly the requirements of the target appli-cations. They have the highest-level of performance and consume very low power. When an application changes, for example, a new standard or protocol appears, or certain features need to be enhanced, an ASIC has to be redesigned for the new application. Normally, the development cycle is very long. Few tape-outs are required in order to fully test the complete application and all its requirements, thereby, increasing the development time and cost.

Type Superscalar VLIW

Instruction Stream

Instructions are issued from a sequential stream

of scalar operations

Instructions are issued from a sequential stream

of multiple operations Instruction

Issue and Scheduling

Issued instructions are dynamically scheduled

by the hardware

Issued instructions are statically scheduled by

the compiler Issue Width The hardware determines

the number of issued instructions dynamically

The compiler determines the number of issued instructions statically Instruction

Ordering

Dynamic scheduling allows in-order and

out-of-order issue

Static scheduling allows only in-order issue

(31)

Programmability is an important feature and it enhances the productivity of a processing element. It is also referred to as flexibility, i.e., how flexible a processing element is to adapt to a new application. Processors, whether general-purpose or embedded, are made programmable in order to provide maximum flexibility. A processor is designed with a basic instruction set, which it needs to support in hardware. Mostly, programmable processors are made fixed and cannot change their organizations after fabrication. A high-level compiler translates an application written in a high-high-level language (such as C) to the machine language of a processor. Hence, when an application changes, it is only a matter of compiling the new application and the hardware remains the same. This avoids the required lengthy development cycles and high costs. The major deficiencies of programmable processors include lower performance and higher power consumption compared to a dedicated ASIC. FPGAs provide design-time as well as run-time configurability. They need to be programmed in HDLs such as VHDL or Verilog. Any kind of digital processing system can be quickly implemented with FPGAs. Initially, FPGAs were small in area/size, slow in speed, and mostly used for prototyping. With the advancement in technology, FPGAs have improved both in area and speed and have become very cheap. Modern FPGAs provide mechanisms to dynam-ically reconfigure some portions while others are still operational. Compared to ASICs, FPGA-based designs require very short development time, hence, minimizing the overall cost. Unlike ASICs, the development of FPGA-based designs can be immediately started, quickly implemented and shipped to the users. They can be updated in the field by downloading a new bitstream. Feed-back from the early design shipments can be used to optimize the final product. FPGA development requires the knowledge of digital circuits and somewhat low-level HDLs. Most of the high-level language developers/programmers do not have the knowledge of hardware and HDLs. Hence, it is difficult for these developers to design for FPGAs. Nowadays, different language conversion tools are available which convert programs written in a subset of a higher-level language to the HDLs. For example, Handel-C [6] is a subset of C lan-guage and the Celoxica DK design tools [7] can convert a program written in Handel-C to a VHDL description, which can then be synthesized for an FPGA or ASIC. The problem with Handel-C type languages is that they are not ex-actly the same as their higher-level language counter-parts. Hence, programs written in a high-level language first need to be converted manually to these languages, which increases the development time and cost. Additionally, these commercial tools are very costly.

(32)

Reconfigurability can also be used in conjunction with programmability. A programmable processor (e.g., a VLIW processor) can be implemented in an FPGA and made reconfigurable. VLIW processors have simple hardware de-sign, consume low power, and can provide high performance. Different param-eters of the processor such as issue-width, the number and type of execution units, the type and size of register file, degree of pipelining, size of instruction and data memories, cache parameters, fault tolerance, peripherals implemen-tation, etc., can be made configurable and selectable at design-time. Hence, an optimized processor in terms of performance, area, power/energy consump-tion, and reliability can be quickly implemented for each application. Addi-tionally, the processor can also be made run-time reconfigurable, where, after the implementation in hardware, certain parameters of the processor can be configured in order to target performance vs. power consumption trade-offs.

1.2 Scope

We foresee that combining programmability with reconfigurability by imple-menting a reconfigurable programmable VLIW processor in an FPGA will have several advantages such as high design flexibility and rapid application development. This approach allows applications to be developed in a high-level language, such as C, while at the same time, the processor organization can be adapted to the specific requirements of different applications both at design-time as well as at run-time. This dissertation proposes the scheme to combine programmability and reconfigurability which can be more precisely stated as:

We investigate an approach in (but not limited to) the embedded processor design that combines programmability and reconfigurability by implementing a programmable processor on a reconfigurable hardware, where the proces-sor can reconfigure its organization for performance, area, power/energy con-sumption, and reliability trade-offs.

Consequently, our approach will distinguish itself from other approaches by the following points:

• reconfigurable programmable VLIW processor: In order to merge pro-grammability with reconfigurability, we propose a programmable VLIW processor that can be configured/tuned at design-time and/or at run-time.

(33)

Statically-scheduled VLIW processors offer improved performance, re-duced area footprint, and rere-duced power consumption compared to a superscalar processor.

• parametrized design and toolchain: The design of the proposed VLIW processor is very simple, made parametrized, and can be easily adapted for different applications. The parametrization of the design eliminates the lengthy manual development cycles or the costly C-to-VHDL tools. The availability of the free parametrized compiler-simulator toolchain [1] provides quick design space exploration and code generation. • design-time and run-time configurability: With the proposed scheme,

highly-optimized implementations can be generated for individual appli-cations. Additionally, processors can be implemented which can adapt themselves at run-time for performance vs. power consumption trade-offs for different applications or different parts of an application. • use as stand-alone processor or co-processor: The VLIW processor can

be used as a stand-alone processor or can be coupled as a co-processor with another processing module (e.g., as in MOLEN paradigm [8]) for off-loading compute-intensive kernels.

• configurable fault tolerance: In order to mitigate single event upset (SEU) errors, configurable level of fault tolerance can be implemented. Fault tolerance can be included or excluded at design-time, and enabled or disabled at run-time.

The following assumptions further define the scope of the research described in this dissertation:

• We mainly focus on hardware design and its optimization for perfor-mance, hardware area, power/energy consumption, and reliability. • Both the development toolchain and the processor design are made

parametrized. The parametrized compiler can generate optimized code for our configurable VLIW processor. In this thesis, we only consider certain defined values for the different types of parameters.

• We consider FPGAs as the reconfigurable hardware in this thesis. In some cases (Chapter 5 and Chapter 6), we also present implementation results for a standard ASIC technology to show trends in power/energy consumption and hardware area.

(34)

• Support for partial reconfiguration is available in some modern FPGAs. We expect that the advances in technology will further simplify partial reconfigurable designs and reduce the reconfiguration times. Further-more, the proposed design scheme does not necessarily depend on par-tial reconfiguration. Run-time reconfiguration can also be achieved with virtual reconfiguration schemes, i.e., by re-arranging (turning ON/OFF or multiplexing) the available resources at run-time.

• Custom or user-defined operations can be added to the hardware design and the compiler can generate binary code for them. Currently, the hard-ware design for user-defined operations has to be developed manually.

1.3 Open Questions

In this thesis, we present one possible approach to merge programmability with reconfigurability. The approach provides opportunity to trade-off between per-formance, area, power/energy consumption, and reliability for different appli-cations, and hence, optimized solutions can be generated. For the successful merging, the following open questions have to be addressed:

• Can FPGAs be programmed without knowing HDLs?

As mentioned earlier, FPGA development requires the knowledge of digital circuits and HDLs. C-to-VHDL tools can be utilized to convert programs written in C to a VHDL description, which can then be implemented in FP-GAs. The problems with these tools are that mostly, these are commer-cial, costly, and not very efficient. In most cases, code re-writing is needed when utilizing these tools, which again restricts their usability. Providing a parametrized/customizable design, where changing certain parameters results in different implementations is one way of avoiding the users to learn HDLs. In this thesis, we will investigate how efficient FPGA designs (programmable processors) can be implemented without knowing much about hardware design and HDLs.

• Can we design flexible and reconfigurable processors which can adapt their functionality to the requirements of applications? Most of the available embedded programmable processors are made fixed in implementation and cannot change their hardware after fabrication. Many dif-ferent applications exist which require difdif-ferent characteristics of the process-ing elements for efficient execution. A sprocess-ingle fixed implementation cannot

(35)

perform well for all applications across different dimensions such as perfor-mance, power/energy consumption, area, code size, etc. In this thesis, we will investigate how we can design flexible processors which can be easily adapted to match the requirements of different applications.

• Can we make these designs dynamic so that they can adapt them-selves during run-time?

With design-time configurability, optimized instance-specific implementations can be generated. However, when the number of applications to be exe-cuted is large or an application consists of several sub-applications, generat-ing, implementgenerat-ing, and maintaining a large number of hardware configurations each tuned to a particular application becomes difficult or even impossible. In this thesis, we will investigate how we can create hardware designs that provide sufficient performance and reduced power/energy consumption for a large number of applications by reconfiguring their organizations at run-time to match the requirements of the applications.

• Can we develop simple techniques for core-morphing and run-time code migration among different cores?

Multi-core systems have multiple cores which can be used in different config-urations. To exploit thread level parallelism, multiple threads of an application or multiple independent applications can be run on the individual cores. Some multi-core systems allow combining certain cores together to exploit ILP. Sim-ilarly, power can be reduced by turning off the un-used cores. In this thesis, we will investigate, how multiple cores in a multi-core processor can be com-bined/split at run-time and how a task running on a core can be migrated to a different core for performance improvement or power reduction.

• What is the impact on performance and energy consumption when both the instruction cache and the processor’s issue-width are si-multaneously reconfigured?

Memory system plays an important role in the performance and power con-sumption of a processor system. When the processor is reconfigured (e.g., issue-width is changed), the memory (caches) may also need to be reconfigured for improved performance or reduced power consumption. In this thesis, for a run-time adaptable processor, we will investigate the effect of simultaneous reconfiguration of the issue-width and instruction caches on the performance,

(36)

dynamic energy consumption, and energy-delay product (EDP) for different benchmark applications.

• Are the implemented designs easily extendable?

User-defined operations can increase the performance and/or reduce the power consumption of a processor. Before implementing a custom operation, a sim-ple method of profiling and simulation to measure its performance is necessary. Because processors are implemented using HDLs, adding a custom operation requires the knowledge of hardware and HDLs. Providing a library of different design-time selectable custom operations and a simple methodology to imple-ment additional custom operations increase the productivity. In this thesis, we will investigate how custom operations can be profiled and simulated at higher level (C language), added to a processor hardware design, and the binary code generated for them.

• Can we implement fault tolerance techniques that are design-time as well as run-time configurable?

In general, hardware-based fault tolerance techniques utilize additional hard-ware to detect and correct faults. This result in increased area, increased power consumption, and reduced performance. In order to optimize these character-istics, a processor should be able to include/exclude or enable/disable fault tol-erance when required. In this thesis, we will investigate how we can develop hardware-based configurable fault tolerance techniques for our configurable processor for hardware area, performance, and power consumption trade-offs.

1.4 Methodology

In this section, we propose the different steps needed to combine programma-bility with reconfiguraprogramma-bility to achieve a trade-off between hardware resources, performance, power/energy consumption, and reliability. These steps are:

• Investigate and propose a parametrizable/customizable design of a programmable VLIW processor that can be configured at design-time to match the specific requirements of each application. Im-plementing such a processor in a reconfigurable hardware, such as FP-GAs means that applications can still be written in a high-level lan-guage, while taking advantages of the reconfigurability provided by an

(37)

FPGA. Multiple parameters and their implementation in different mech-anisms allow a trade-off between hardware resources, performance, and power/energy consumption. Utilizing a parametrized/customizable de-sign avoids to use any C-to-VHDL tool, provides high dede-sign flexibility and rapid application development.

• Investigate and propose the parameters for the proposed VLIW pro-cessor that can be reconfigured at run-time to match the specific re-quirements of a running application. Parameters such as issue-width, number and type of different FUs, register file size, etc., effect the per-formance, hardware area requirement, and power/energy consumption of an application. We will investigate and propose run-time techniques that allow running tasks to migrate from one core to another core in order to improve performance or power consumption characteristics at run-time. Additionally, we will investigate the effect of simultaneous reconfiguration of issue-width and instruction cache on the behavior of different applications.

• Investigate and propose configurable fault tolerance techniques for the proposed VLIW processor in order to mitigate SEU errors. We will investigate and propose hardware-based techniques which allow fault tolerance in a processor to be included/excluded at design-time and/or enabled/disabled at run-time in order to trade-off between hard-ware resources, performance, power consumption, and reliability.

1.5 Dissertation Organization

The remainder of this dissertation is organized in several chapters. Following, we present a brief summary of each chapter.

Chapter 2 – Background

Chapter 2 presents the background and motivations for the adaptable VLIW processor system needed for combining programmability and reconfigurabil-ity. The chapter highlights the VEX system which includes the VEX ISA, the VEX C compiler, and the VEX simulator. An earlier design of a VLIW pro-cessor is presented and its limitations are listed, which are later on, addressed in the thesis. Finally, the chapter presents some previous work related to the state-of-the-art in reconfigurable processors.

(38)

Chapter 3 – Design-time Configurable Processor

Chapter 3 presents the design and implementation of a parametrized and con-figurable VLIW processor based on the VEX ISA. The parameters include the processor’s issue-width, the type and number of different FUs, type and size of register files, etc. These parameters can be configured/customized at design-time before implementing the processor in hardware.

Chapter 4 – Run-time Reconfigurable Processor

When the characteristics of an application are not known at the design-time, ef-ficient processor’s organization may not be selected for it, resulting in reduced performance and/or increased power consumption. In Chapter 4, we extend the processor design presented in Chapter 3 to make it run-time reconfigurable in order to meet the requirements of the running application(s).

Chapter 5 – Configurable Fault Tolerance

Chapter 5 presents hardware-based configurable fault tolerance techniques for our configurable processor. At design-time, users can choose between the stan-dard non fault-tolerant design, a fault-tolerant design where the fault tolerance is permanently enabled, and a fault-tolerant design where the fault tolerance can be enabled and disabled at run-time. These options enable a user to trade-off between hardware resources, performance, power consumption, and relia-bility characteristics.

Chapter 6 – Experimental Results

Chapter 6 evaluates the effectiveness of our (re)configurable processors pre-sented in the previous chapters. The hardware area/resources and the critical path delay (maximum clock frequency) were evaluated in these chapters. In this chapter, different metrics such as, performance (execution cycles, IPC), power/energy consumption, and EDP are utilized for different configurations of the proposed processors and different benchmark applications.

Chapter 7 – Conclusions

Chapter 7 summarizes the work presented in this dissertation and describes the main contributions of the research. Finally, several open issues and future work directions are listed.

(39)

2

Background

I

n Chapter 1, we discussed the advantages and disadvantages of VLIW and

superscalar processors in detail. Both processors have multiple parallel execution units to exploit ILP. In case of a VLIW processor, a compiler is re-sponsible to find independent operations in a program and issue them together in a single clock cycle. For a superscalar processor, hardware determines operation dependence and resource availability at run-time. Therefore, the design of a VLIW processor is simpler compared to that of a superscalar pro-cessor at the expense of a complex compiler. Because a superscalar propro-cessor requires larger die size and consumes more power, it is not suitable for embed-ded systems which require area and power consumption as small as possible. Building a production-quality, high-performance optimizing VLIW compiler requires large effort, therefore, when considering the space of possible VLIW processor designs, it is always recommended to start with an available ISA and compiler, not the available hardware. Based on this, we started our research by utilizing one available compiler toolchain rather than building a new one. In this chapter, we provide some background information for the work carried out in this dissertation.

The remainder of the chapter is organized as follows. Section 2.1 presents the motivations for an adaptable VLIW processor, discusses the VEX system, intro-duces an initial design of theρ-VEX processor and lists its limitations. Some previous work related to the state-of-the-art in softcore and configurable/fixed processors is presented in Section 2.2. Finally, Section 2.3 concludes the chap-ter with a summary.

(40)

2.1 Adaptable VLIW Processor

An adaptable processor can adapt its organization according to the require-ments of an application. This adaptability can be achieved at design-time, i.e., before an application starts execution or even at run-time when the application is running on the processor. In this thesis, we present an adaptable VLIW pro-cessor and highlight its benefits. The propro-cessor is based on the VEX ISA [4] and a toolchain [1] (C compiler and simulator) is freely available for archi-tectural exploration and code generation. The processor combines both the programmability and reconfigurability to achieve high flexibility and high per-formance at the same time. It provides opportunities to compare perper-formance, hardware resources, power/energy consumption, and reliability trade-offs.

2.1.1 Motivations

As discussed in Chapter 1, our proposal for combining programmability and reconfigurability requires an adaptable/reconfigurable VLIW processor. In-stead of the other design philosophies mentioned in Section 1.1.2, we chose a VLIW processor as the starting point because of the following advantages:

• increased performance: Compared to a single-issue RISC processor, a VLIW processor can provide improved performance by exploiting ILP. While RISC architectures can only benefit from temporal parallelism by utilizing pipelining, VLIW architectures can additionally benefit from spatial parallelism by utilizing multiple FUs concurrently. A VLIW pro-cessor can potentially provide more performance compared to a same-issue superscalar processor due to the larger room for compiler opti-mizations.

• reduced power consumption: Because a superscalar processor utilizes complex control hardware for run-time scheduling of instructions, it consumes more power than a VLIW processor. According to an estimate by [5], the scheduling logic of a superscalar processor alone consumes more power than an entire VLIW processor of the same issue-width. • simple hardware: The compiler takes care of all the dependencies and

scheduling in case of a VLIW processor, while a run-time hardware does the same job for a superscalar processor. Therefore, the hardware of a VLIW processor is very simple and straight-forward at the expense of

(41)

a complex compiler, and hence, can achieve higher clock frequencies to further improve the performance.

• availability of existing tools: The compiler for a VLIW processor is very complex and requires significant efforts and time to develop from scratch. Fortunately, for the VEX ISA, a toolchain is freely available from HP. The VEX toolchain [1] includes a parametrized C compiler and simulator which can be used for design space exploration and code generation for different implementations of the VEX processor. Other open-source compilation frameworks such as Trimaran [9] etc., could also be easily adapted.

• no need for language translations: As stated earlier, designing for FP-GAs requires the knowledge of hardware and HDLs. Most of the high-level language programmers do not have this knowledge. High-high-level-to- High-level-to-HDL translation tools are used, which place some restrictions on high-level languages and in most cases code rewriting is required when using such tools. With VLIW processor and its toolchain, programs can still be written in high-level languages (such as C), while taking advantages of the reconfigurability provided by an FPGA.

Apart from these basic advantages of a VLIW processor, following are the reconfigurability-specific benefits:

• static reconfigurability: Static reconfigurability means that the proces-sor can be customized for a particular application before it is imple-mented in hardware. With the help of the simulator, processor param-eters most suited for the targeted application(s) can be evaluated and determined. Hence, optimized designs can be implemented for each ap-plication.

• dynamic reconfigurability: Dynamic reconfigurability allows the pro-cessor to adapt its organization after it is implemented in hardware. When multiple applications need to be run, or the application’s precise characteristics are not known at design-time, a single implementation cannot be optimized for them. In this case, the processor can be designed such that it can change some of its parameters (e.g., issue-width, number of registers and different execution units, cache size, etc.) at run-time to match the specific requirements of the running application(s).

The fixed nature of traditional VLIW architectures has certain intrinsic dis-advantages which prevented them to become mainstream processors. These

(42)

disadvantages can be mitigated by implementing a VLIW processor on recon-figurable hardware. In the following, we highlight the most important prob-lems that arise from the fixed design of a VLIW processor and their solutions: • different instruction lengths: As stated earlier, different applications have different level of parallelism, and require different instruction word widths for efficient execution. A fixed processor may not exploit differ-ent level of parallelism very efficidiffer-ently. This problem can be dealt with by implementing a parametrized and reconfigurable VLIW processor. Different instruction decoders can be instantiated/configured to provide different instruction word widths by either reconfiguring the issue-slots or sharing the unused issue-slots among other cores.

• high number of NOPs: A fixed VLIW processor may not meet the requirements of an application parallelism, and hence, a large number of NOPs may be scheduled. This scenario results in under-utilization of the available hardware resources. A parametrized/reconfigurable processor can adapt its organization/issue-slots to match the requirements of the application and avoid this under-utilization.

• unavailable FUs per issue-slots: NOPs are scheduled when issue-slots do not have the required FUs, thus increasing the under-utilization. With reconfigurable implementation, the required FUs can be added per ap-plication basis or even per phase of an apap-plication.

• backward compatibility: Code recompilation is needed when new ver-sions of a VLIW processor is released. The reason could be a new organization of the FUs or a different set of added instructions. Back-ward compatibility can be relaxed by providing dedicated organizational features in the reconfigurable hardware for particular already-compiled code. Similarly, rarely used instructions can be instantiated when needed to support a legacy code.

Having stated how a parametrized and reconfigurable VLIW design can over-come the traditional shortcomings of a VLIW processor, in the following, we present the two most likely used scenarios for such a processor:

1. stand-alone processor: In this scenario, complete applications are com-piled and they (or their threads) run on the VLIW processor. The pro-cessor can be configured at design-time to suit a particular application. Additionally, it can be reconfigured at run-time to suit multiple applica-tions or multiple code porapplica-tions of an application.

(43)

2. application-specific co-processor: In this scenario, only compute-intensive kernels are compiled to the VLIW processor while the remain-ing part of the application runs on another type of processremain-ing element. Hence, there is no need for code rewriting, complex tools such as C-to-VHDL translators, and manual design of accelerators, as in the case of a MOLEN processor [8].

2.1.2 The VEX System

The VEX (VLIW Example) system is developed by Hewlett-Packard (HP). It includes three basic components: (1) the VEX ISA, (2) the VEX C compiler, and (3) the VEX simulation system. A VEX software toolchain including the compiler and simulator is made freely available by the HP [1].

The VEX Instruction Set Architecture The VEX ISA is a scalable and customizable 32-bit clustered VLIW ISA [4]. It is modeled on the ISA of HP/ST Lx (ST200) family of successful VLIW embedded processors [10]. The VEX ISA is scalable because different parameters of the processor such as the number of clusters, issue-width per cluster, the number and type of different FUs and their latencies, and the number of read/write ports and size of register file, etc., can be changed. The ISA is customizable because special-purpose instructions can be defined in a structured way. It includes many features for compiler flexibility and optimization.

The VEX C Compiler The VEX C compiler [1] is derived from the Lx/ST200 C compiler, which itself is derived from the Multiflow C compiler [11], and includes high-level optimization algorithms based on trace schedul-ing [12]. It has the robustness of an industrial compiler, has a command line interface and is available as closed source (binary form). Because the VEX ISA is scalable and customizable, the compiler also supports the scalability and customizability. A flexible machine model determines the target architec-ture, which is provided as input to the compiler in the form of machine model configuration (fmm) file. Hence, without the need to recompile the compiler, architectural exploration of the VEX ISA is possible with the compiler and simulator. To add a custom operation, the application code is annotated with pragmas. Different compiler pragmas and optimization options are available for performance improvement [4]. Applications can be compiled with profiling flags, and the GNU gprof can also be utilized to visualize the profiled data.