Scalable Architectures: The Future is Multicore

(1)

Maxwell11.2 March 2008

21

Scalable Architectures

The Future is Multicore

“The future is multicore”, it sounds like a slogan from an Intel

commercial. Indeed, a few years ago Intel introduced the Core Duo

microprocessor and, more recently, quad-core technology. But,

is it all pure marketing? Was Intel tired of advertising with clock

frequencies? Did they need another buzz word because

‘Hypertrea-ding’ became too common? The answer is a definite no! Intel and

other processor vendors were forced to put several processors or

cores on a chip. In other words, they were forced to go multicore.

Authors: Ben Juurlink and cor Meenderinck

In the past, microprocessor performance has almost doubled every 18 months. In the period 1985-2000, performance increased by more than a factor of 1500. This tremendous performance improvement was due to technology advances but also due to advanced architectural and organizational enhancements. For example, pipelining allows to use higher clock frequencies because less work needs to be done in a cycle. This technique was used to the max in the Intel Pentium 4 processor which has more than 20 pipeline stages. Two of these stages are just used to transfer signals across the chip. Another technique that had a significant impact on performance is the exploitation of instruction-level parallelism (ILP). Instead of executing a single instruction per cycle, processors can issue and execute multiple instructions per cycle. This technique requires special hardware to detect independent instructions that could be executed in parallel.

A few years ago, however, this evolutionary path of ever higher clock frequencies, ever deeper pipelines, ever higher issue widths, and so on, has ended. Instead,

microprocessor vendors started putting several processors, or cores, on a chip. We mentioned the Intel Core Duo and quad-core technology. AMD shortly followed with their dual and quad cores. Sun introduced the Niagara family for server workloads, with eight cores per chip, each capable of running four threads in parallel. IBM launched the Power5 and Power6 with two cores each. And of course there is the IBM/Sony/Toshiba Cell processor

used in the Sony Playstation 3 console which has nine cores.

Power wall

So we have entered the era of multicore but what has caused this shift? The reason is power, in particular power density. In the keynote speech of the international symposium on micro architecture (MICRO32) in 1999, Fred Pollack of Intel showed a power density diagram of the Intel processors. Like Moore’s law, which states that the number of transistors on a chip doubles every 18 months, the power density is exponentially growing with decreasing technology nodes. The Pentium II in 0.5u technology surpassed the power density of a hot plate. Extrapolation shows that surpassing the 0.1u node the power density would be that of a nuclear reactor and eventually that of a rocket nozzle! In contrast with Moore’s law nobody expects this trend to continue. We will not be walking around with a nuclear reactor in our pockets. The power wall is rigid and this graph

_Q

1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ 1000 100 10 1

Po

w

er

d

en

si

ty

W

at

ts

/c

m

2 i486 i386 Pentium Pentium Pro Pentium II Pentium III Pentium 4 hot plate nuclear reactor rocket nozzle

Figure 1: Power density trend. Source: Fred Pollack, Intel. Keynote speech Micro32, 1999.

(2)

22

merely indicates the seriousness of the power problem.

As power has become one of the main bottlenecks for performance, logically the design constraints for computer architects have changed. Because the (dynamic) power grows quadratic with the supply voltage and linear with the frequency, and because frequency is more or less proportional to the supply voltage, it is more power efficient to place two cores on a chip running at 1 GHz than it is to use one core running at 2 GHz. Thus the new metric is performance per Watt. Furthermore, techniques that increase ILP, like superscalar, out-of-order execution, and hyper-pipelining, have become extremely expensive as they are very power inefficient. The way to improve performance while staying within the power budget is by trading ILP for thread-level parallelism (TLP). That is, some ILP is sacrificed by omitting the previously mentioned power inefficient techniques. Instead, the available transistors are utilized to increase TLP by instantiating multiple cores per chip. Duplicating cores like this produces a symmetric or homogeneous multicore, consisting of identical cores.

From multi to many

So, how many cores can we expect on a chip in the not so distant future? Of course it is difficult to predict, especially the future, but we can make a good effort if we base our prediction on Moore’s law.

For example, we can do the following case study. Take an existing processor core of which the size is known, scale it to future technology nodes, and create a hypothetical homogeneous chip multiprocessor with those scaled cores. This will produce a reasonable estimation of what a future multicore might look like. Let’s take the Alpha 21264 processor as a starting point. This core dates from 1998, is modestly sized and represents what we expect to be an average core for the future. The 21264 is build in .35µm technology and has a die area of 314mm2_{. Scaling is}

done using the data of the International Technology Roadmap for Semiconductors that provides detailed predictions for all technology parameters. For example, in 2015 the technology size is expected to be 25nm and in that technology node the 21264 will be 1.6mm2_{large (better:}

1.6mm2_{small). According to the roadmap}

the die area does not grow, thus in 2015 a total of 196 cores would fit on a die. Figure 2 shows the number of cores per die as a function of time. It shows that the number of cores is growing exponentially. In general researchers believe that 100 to 1000 cores per die is attainable. Such a system is no longer called a multicore but a manycore instead.

From homogeneous to

heterogeneous

Although moving to homogeneous multicores will postpone it a bit, in the future they will also hit the power wall. So what can be done thereafter? One way

is to go heterogeneous: use different cores that are particularly suited for a certain application or application domain. The Cell processor is already an example of such a heterogeneous multicore and consists of one Power Processing Element (PPE) and eight Synergystic Processor Elements (SPEs). The first is a general purpose processor and acts as the control processor. The latter are domain specific cores specialized for pixel processing as needed for games. An SPE is a so-called Single-Instruction Multiple-Data architecture. That means that its instructions do not process single (or scalar) values, as conventional processors do, but multiple values. In essence, each SPE instructions is a short vector instruction. This is very efficient from a power perspective because less power is spent on fetching, decoding, and issuing instructions. In the SARC project (see sidebar) we are investigating such heterogeneous multicores of the future.

Software challenges

From an architectural point of view manycores are rocking, but software

SARc PRoJEcT

in the SARc project (www.sarc-ip.org), which is funded by the European Union, a consortium of European research groups including the computer Engineering sec-tion of TU Delft is studying heteroge-neous multicores. The objective of this project to investigate a scalable, power efficient, and programmable computer architecture, together with the necessa-ry programming models, compiler sup-port, on- and off-chip networks, and de-sign space exploration tools. The SARc project provides work for many PhD and many MSc students. if you’re an MSc student and interested in joining the pro-ject, please contact one of the authors at

b.h.h.juurlink@tudelft.nl or

c.h.meenderinck@tudelft.nl.

Figure 2: Case study future chip multiprocessor.

(3)

2

developers might feel a whole lot different about it. Programming a single core is already quite complex, programming a manycore is an even bigger challenge which in the past was reserved to specialists. In existing multicores TLP is mainly exploited by running different applications on each processor, which works as there are only few cores. For manycores this strategy will not work. In order to take advantage of a manycore, applications must be highly parallelized, i.e., decomposed into threads that can be processed simultaneously. Although there is some knowledge available from the supercomputer field, a lot of work has to be done in this area.

To have at least an idea of the number of threads that can be found in applications, we analyzed the parallelism available in H.264 video decoding. Video processing is expected to be an important workload in the future for low power mobile devices up to high performance servers. The first will be used to watch video streams, while the latter might be used to browse

your video archive to find all videos with Aunt Betty in it. H.264 is the latest video compression standard and offers the best picture quality and compression ratio. H.264 videos consist of a sequence of pictures called frames and each frame is subdivided into separate areas called macroblocks. H.264 decoding is difficult to parallelize due to the many dependencies created at encoding time. Steps in H.264 such as motion compensation, intra prediction, and the deblocking filter all create dependencies between the macroblocks within a frame as well as between macroblocks in different frames. However, researchers of the Computer Engineering section of TU Delft have found a novel strategy to exploit parallelism within video frames concurrently with parallelism amongst frames. For full HD resolution movies a maximum parallelism of 5000 up to 9000 threads was found. This is more than sufficient for the manycore of the year 2020 and much more than what is needed for real-time, but having many more threads than cores allows optimization of

the application, for example, by creating coarser grained threads by collapsing some of the fine-grain threads, thereby amortizing the overhead of thread creation.

challenges

Computer architecture has entered a whole new era with brand new challenges. Instead of increasing clock frequency and boosting ILP, we entered the era of multicores. This raises questions on how to design such multicores in order to exploit them efficiently and how to map applications to them. The Computer Engineering section took this challenge and is currently answering many of those questions.

_A

Figure 3: Sun’s Niagara 2 (a.k.a. UltraSPARC T2) die.

Scalable Architectures: The Future is Multicore

21