Finding representative workloads for computer system design

(1)

F

INDING REPRESENTATIVE

(2)

(3)

F

INDING REPRESENTATIVE

WORKLOADS FOR COMPUTER SYSTEM

DESIGN

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. dr. ir. J.T. Fokkema, voorzitter van het College voor Promoties,

in het openbaar te verdedigen

op dinsdag 18 december 2007 om 17.30 uur

door

Jan Lodewijk Bonebakker sterrenkundige

(4)

Prof. dr. H.G. Sol

Prof. dr. ir. A. Verbraeck

Samenstelling promotiecommissie: Rector Magnificus, voorzitter

Prof. dr. H.G. Sol, Technische Universiteit Delft, promotor

Prof. dr. ir. A. Verbraeck, Technische Universiteit Delft, promotor Prof. dr. D.J. Lilja, University of Minnesota, USA

Prof. dr. ir. K. De Bosschere, Universiteit Gent, Belgium Prof. dr. P.T. de Zeeuw, Universiteit Leiden

Prof. dr. ir. H.J. Sips, Technische Universiteit Delft Prof. dr. ir. W.G. Vree, Technische Universiteit Delft

Copyright c!2007 by Lodewijk Bonebakker

Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.

The data and the data collection tools presented in this thesis are the intellectual property of Sun Microsystems, Inc. The data are presented solely for the purpose of this thesis and may not be used without written permission from both the author and Sun Microsystems, Inc.

ISBN/EAN: 978-90-5638-187-5 Cover: Lucas Bonebakker

Printing: Grafisch Bedrijf Ponsen & Looijen b.v. Wageningen, The Netherlands

(5)

(6)

(7)

1. THE IMPORTANCE OF WORKLOADS IN COMPUTER SYSTEM

DESIGN

ABSTRACT

(14)

Deductive-Hypothetic research strategy Define the question

Chapter 1 - The importance of workloads in computer system

design

Collect data

Chapter 6 - Collecting and analyzing workload

characterization data

Form hypothesis

Chapter 3 - Towards an unbiased approach for selecting representative workloads Prescriptive model Chapter 4 - Constructing the approach Early test

Chapter 5 - Testing the methodology on a benchmark set Present conclusions Chapter 9 - Evaluating workload similarity Analyze data Chapter 7 - Grouping together similar workloads

Interpret data and formulate conclusions

Chapter 8 - Finding representative workloads in the measured dataset

Gather information and resources

Chapter 2 - Current approaches for workload selection in processor and

computer system design

Define the question

(15)

1.1 Introduction 3

1.1 Introduction

A computer system is the end result of a long, complex and many year, multi-dimensional development process. The resultant computer system represents the best possible solu-tion to a design challenge given restricsolu-tions in time, technology and resources. Kunkel et al. (2000) explain that many decisions in the computer system, specifically the pro-cessor, were made years before the computer system or the processor even existed. This thesis looks at the problem of providing accurate information to computer system and processor designers during the design and implementation stages. This thesis develops and evaluates a methodology that enables the processor and computer system designers to take into account information from current computer systems usage when considering and evaluating new computer system and processor designs. Incorrect information can lead to incorrect design trade-offs. By supplying and updating relevant information dur-ing the design and implementation stages, designers can keep their design current with the marketplace. The technological challenges involved in computer system design will likely increase the time between initial design specification and market release. This necessitates flexible approaches to computer system and processor design which can adapt to market changes. Computer system designers have no influence on the diversity of applications and usage of computer systems, thus they can only react to trends in the marketplace.

The contribution of this work is to provide a clear methodology by which processor and computer system designers can use information on actual computer system usage during the design process. The expected benefits are that many of the trade-offs can be judged against usage characteristics present in the marketplace. This methodology would help prevent poor design choices that might not be discovered using the limited set of benchmarks common to most current design and evaluation studies.

This first chapter provides a description of the design process and defines the terms

workload and benchmark. This chapter establishes why and how workloads and

bench-marks are used in the computer system design process. We discuss three processor design cases that illustrate the need for better workload characterization information. The limitations of the design process are discussed, leading to the research question. After the research question, the underlying research philosophy, research approach and research instruments are introduced.

1.2 Designing computer systems and processors

(16)

Rowe et al., 1996). The systems engineering discipline has associated with it a

lan-guage, i.e., a collection of terms with defined meaning, used to express the nuances of

system design and its approaches. Sage (1995) defines systems engineering as the art and science of producing a product, based on phased efforts, that satisfies user needs. The system is functional, reliable, trustworthy, of high quality, and has been developed within cost and time constraint through the use of an appropriate set of methods and tools. In this thesis we follow the terminology of systems engineering to describe the development process.

Modeling and simulation are the primary tools for performance analysis and predic-tion of proposed computer system designs (Austin et al., 2002; Yi et al., 2005). Simu-lators are the dominant tool for evaluating computer architecture, offering a balance of cost, timeliness and flexibility (Yi et al., 2006). Model complexity and slow execution speed necessitate that system models are developed hierarchically (Yi and Lilja, 2006). Representations of computer system behavior are extensively used to provide input to the models. The system design process uses iterative refinement of the model (Schaffer, 1996), at each step evaluating its principal parameters, e.g., cost, performance, function-ality and physical constraints (Kunkel et al., 2000). In short, computer system design is the application of an iterative design process in an hierarchical modeling environment (Schaffer, 1996; Coe et al., 1998). Next, we expand on this iterative design process.

1.2.1 The iterative design process

The iterative design process uses the predictions and results from computer system mod-els to determine the design changes and optimizations for the next model iteration. The modeling environment consists of performance and component models of the computer system at varying levels of detail. These models vary from very simple and rough per-formance estimates to highly complex and detailed simulations (Kunkel et al., 2000). The modeling hierarchy abstracts details away to component models lower in the hier-archy. Conversely, abstractions based on predictions from component models are used at the higher levels in the modeling hierarchy (Zurcher and Randell, 1968; Kumar and Davidson, 1980; Mudge et al., 1991; Stanley and Mudge, 1995; Kunkel et al., 2000; Hennessy and Patterson, 2003; Larman and Basili, 2003). Processor, memory system, or whole computer system designs are iteratively refined until the performance criteria have been optimized for the given constraints (Flynn, 1995; Schaffer, 1996; Coe et al., 1998). Although the actual process varies between companies and between the types of computer systems or components under design, there is a set of common steps (Zurcher and Randell, 1968; Kunkel et al., 2000; Larman and Basili, 2003), illustrated in Figure 1.1:

(17)

1.2 Designing computer systems and processors 5 Strategic workloads Technological developments Iterative Incremental Development Cycle Final design Initial planning Planning Requirements Analysis & Design Implementation Testing/ Verification Evaluation Model/ Prototype Parameters: - functionality - performance - cost - physical requirements Hierarchical, multi-level simulation environment

Fig. 1.1: Iterative design strategy investigated.

2. Project dominant workloads: which workloads or applications are most likely going to benefit from the intended architecture, or, which workloads are the target of the intended architecture. These workloads need to be thoroughly characterized and performance data collected.

3. Create a hierarchical, multi-level simulation environment: this environment will be the principal means of investigating design trade-offs and performance opti-mization.

4. Run an iterative incremental development cycle using the simulation environment to gradually increase the level of detail and complexity in the models to the point that the models present a fair representation of the desired computer system. The steps in the development cycle echo those of Boehm (1986): planning, analysis & design, deployment, testing and evaluation.

5. Review relative to goals. After several development cycles, the overall design effort is reviewed relative to the goals; if necessary, either the goals or the devel-opment effort is adapted.

6. In parallel with the iterative incremental development cycle: do the verification of the proposed computer system logic.

(18)

workloads as listed in item 2 requires the ability to quantitatively evaluate many work-loads and place them in the context of the marketplace (Kunkel et al., 2000). During the iterative design cycle simulation is the instrument of choice. The time and cost involved in building computer system prototypes makes even the most complex simulation cost competitive (Kunkel et al., 2000; Eeckhout et al., 2002; Alameldeen et al., 2003).

While the modeling environment with its simulators facilitates performance evalua-tion of future computer systems, it is essential to understand what that performance is relative to. As mentioned in item 2, workloads are the essential ingredient providing the performance context.

1.2.2 Workload characterization

Ferrari (1978) defines a workload as the work performed by a computer system over a period of time. All inputs into the computer system, e.g., all demands that require ex-ecution time, are part of the workload definition. During exex-ecution a workload utilizes the resources of the computer system: processor, processor cache, system-bus, memory, disks, etc. The resource utilization of a workload depends on the design of the underly-ing computer system and the implementation of the workload. Different workloads can require different computer system resources (Ferrari, 1972; Agrawala and Mohr, 1975; Agrawala et al., 1976; Menascé, 2003).

Workload characterization describes the resource utilization of a workload on an

ex-isting computer system. Workload specific characteristics determine the resources re-quired of a computer system to achieve a desired level of performance. Most workloads can be described independent of the computer system. Specifically, workloads that are used for comparing different computer systems should be defined independent of the computer system implementation (Gustafson and Snell, 1995). However, describing a workload independent of the computer system with sufficient detail for computer sys-tems comparison can require a significant investment in time and resources. Workloads can be described by their intended work (e.g., the number of served web-page requests), or they can be described by their impact on the computer system (e.g., the utilization of the processor). The goal of workload characterization is to create a workload descrip-tion that can be used for selecdescrip-tion, improvement and design studies. These areas share common techniques of workload characterization and the distinction is based more on the purpose of the workload characterization than its execution (Ferrari, 1978):

Selection studies use workload characterization to assist in the selection of computer

systems or components.

Improvement studies use workload characterization to determine the changes in

(19)

1.2 Designing computer systems and processors 7 on the unmodified system, can be used to evaluate the performance of the mod-ified system. The requirement for the workload model is that it is modification independent, thus the workload model must be transportable.

Design studies require workload characterization to do performance analysis. Without

workload characterization, computer systems and components cannot be quanti-tatively improved. The characterization and analysis of workloads is essential to provide the designers of computer systems a goal to work towards. For design studies the specification and selection of workloads to use is probably the most difficult and significant of the three studies.

Workloads provide the information on resource utilization required in the hierarchi-cal modeling environment. Different workloads must be considered when designing a computer system because computer system performance depends both on the workload characteristics and its design (Kunkel et al., 2000). Workload characterizations from real computer systems provide quantitative information on such resource utilization. Design changes can lead to changes in the utilization of computer system components. To eval-uate these changes an implementation independent characterization of the workload is required (Ferrari, 1978).

For design studies, computer systems designers want to select relevant workloads. Relevant workloads are assumed to contain important workload characteristics. The relevance of these workloads can be determined by their frequency among customers, the value of the computer systems required, or their importance for marketing reasons (Kunkel et al., 2000). The primary goal for computer system architects is to increase the performance of a computer system for relevant workloads. Computer system architects must include information from many different workloads to make appropriate design trade-offs in general-purpose computer systems. A good computer system design will not let a single resource limit performance for relevant real workloads. Resources are

balanced to achieve the best possible performance on relevant workloads (Kunkel et al.,

2000).

1.2.3 Performance evaluation

(20)

or implementing logic in field programmable arrays (FPGA). Both these alternatives are not realistic due to the cost of their implementation and the cost of validation: validat-ing simulation results is straightforward compared to debuggvalidat-ing an actual prototype or FPGA implementation. Simulation itself is not cheap. Modern computer systems can execute billions of instructions per second; therefore cycle accurate simulators need to efficiently simulate these billions of events. Cycle accurate simulators may take days to simulate a billion instructions. The best simulators introduce only several orders of magnitude slowdown when simulating detailed execution (Eeckhout et al., 2002; Alameldeen et al., 2003). Hierarchical simulation improves the performance of the higher level simulation by relying on the results of more detailed and slower simulation in the lower levels of the simulation hierarchy.

While simulation is the most cost-effective way of verifying computer system designs on certain workloads, it is still expensive in time and resources. This high cost greatly limits the applicability of simulation in the design cycle. It is impossible to do detailed simulations of all workloads of interest due to the cost and time associated with the simulation process. Characterizing workloads at the required level of detail and then running the simulations for all those workloads cannot be completed within a reasonable time or cost (Darringer et al., 2000; Magnusson et al., 2002; Kim et al., 2007). As a result, the designers of computer systems have to chose which relevant workloads to use during the design process. Naturally designers will want to use convenient workloads - workloads that are less expensive to use in time and resources but still provide the required level of detail.

1.2.4 Benchmarks

(21)

1.3 Overview of processor and computer system design 9 benchmarks over real workloads is such that using real workloads in the design cycle is no longer considered.

As noted before, with sufficiently accurate simulators, computer system architects reduce risk by evaluating their designs using benchmarks that represent the most impor-tant characteristics of real and emerging workloads. Choosing the representative bench-marks therefore is of the utmost importance to the designers. Over time, real workloads evolve and new workloads emerge, so relevant workload characteristics can change. This evolution creates the need to detect important emerging workload characteristics and represent them in benchmarks (John et al., 1998; Skadron et al., 2003). Without new benchmarks that represent changes in the computer usage, the designers of computer systems can only rely on their limited set of standardized benchmarks. Certain common benchmarks, i.e., from SPEC (www.spec.org, 2007) and TPC (www.tpc.org, 2007), are used by most computer system architects during the design of commercial servers and processors. These benchmarks provide valuable metrics for comparing per-formance of existing computer systems. Achieving good perper-formance on these bench-marks is important based on the value attributed to these benchbench-marks by consumers. However, as we shall investigate later, these standardized benchmarks may not repre-sent actual usage of computer systems. Most real workloads and computer system con-figurations differ from the standardized benchmarks, thus computer system designers might require information from real workloads to determine the configuration attributes of computer systems. These configuration attributes are for example the number of supported processors, the memory size, I/O capacity etc. Designing and configuring computer systems solely based on information from a limited set of benchmarks could deny the diversity of workloads in the world.

1.3 Overview of processor and computer system design

Computer system and processor designers generally aim for the best achievable per-formance on relevant workloads (Kunkel et al., 2000; Hennessy and Patterson, 2003). While all processors do more or less the same thing, not all are created equal. Much depends on a processor’s internal workings, called the micro-architecture, abbreviated to µ-architecture. The µ-architecture performs five basic functions: data access,

arith-metic operation, instruction access, instruction decode and write back results.

(22)

PU system bus L1 system bus L2 RAM Disk Processor Latency Cache 2ns 14ns Main memory 200ns Hard disk 2000000ns

Fig. 1.2: Processor-memory-disk hierarchy

or stages, each stage performing a specific function. Example stages are fetch, decode,

execute, memory access and writeback (Hennessy and Patterson, 2003). The processor’s

pipeline decodes instructions and fetches required data from the caches in time for ex-ecution. The number of stages in a pipeline is important because longer pipelines have more disruptive pipeline hazards. A hazard is a conflict in the pipeline that may lead to stalls and thus lower performance. Current processors attempt to predict application execution paths or branches, and are very capable in doing so. However, misspredicting the next branch still happens roughly 10 percent of the time. Branch missprediction is expensive enough that further reducing the missprediction penalty remains an active area of research (Sprangle and Carmean, 2002).

Processors typically realize their mistakes during the last quarter of the pipeline, so the longer the pipeline, the longer it takes to flush the message from the pipeline and fix the problem. As a result, performance suffers. This explains why a lower clock-speed, shorter pipeline processor can outperform an higher clock-speed longer pipeline (Allbritton, 2002). It also demonstrates how workload characteristics impact perfor-mance. The processor’s clock-speed determines the rate at which the pipeline supplies and the processor executes instructions. Processing interruptions occur when required data are not available. Data that are not available are first sought in the processor’s caches and then in main memory. In the worst case the data has to be retrieved from disk. The Processor-memory-disk hierarchy is illustrated in Figure 1.2. It includes the typical time scale for the processor to access each level in the hierarchy. If data must come from main memory, the data are said to have been missed in the caches. Cache

misses can delay execution by causing the pipeline to stall until the missing data arrive.

(23)

1.3 Overview of processor and computer system design 11 Hoogeveen, 2007):

Imagine sitting behind your laptop on a desk, somewhere in Amsterdam, the Netherlands. You are writing a document. While writing, most of the information you need is in your head; this is equivalent to execution on the processor only. In some cases you will need to consult your notes; which are next to you on your desk. The process of moving your attention away from the laptop, to your notes and back to the laptop (circa 10 seconds) is akin to accessing the L1 cache - it is fast and inexpensive. Unfortunately, your notes are limited. Sometimes a piece of information you need is not in the notes and you have to get up, turn around and consult a book on your bookshelf (one minute). The time needed represents a L2 access. If you discover that the required book is not on the shelf, you have to go to the local library. You get up again and drive to the local library, several kilometers away. The time needed to go to the local library and return home (about 15 minutes) represents accessing main memory. If at the library you discover that the required information is not there, you have the equivalent of retrieving data from disk. In this case you get up from your desk, walk to the library in Moscow (circa 2150km), retrieve the information and walk back (circa 100 days!).

As we can see from the above analogy, efficient cache-miss handling is a significant contributor to processor performance for workloads with a high missrate. Main mem-ory latency is determined by the speed of DRAM (the actual memmem-ory), system bus and memory controllers. Similarly, disk latency is determined by the speed of the hard disk, the I/O interface and the system bus. Computer systems are designed to support the best achievable transfer of data between a processor, its caches, main memory, network interfaces, and disk. Large computer systems can support multiple processors, large memories, many network interfaces and disks. Component interactions in large com-puter systems increase design complexity (Hennessy and Patterson, 2003). Efficient branch-prediction on a processor is a significant contributer to good workload perfor-mance. Correctly predicting execution path of a program allows efficient pre-fetching of required data, thus reducing missrate on the correctly predicted branch in the pro-gram (Hennessy and Patterson, 2003). Another significant factor influencing missrates is the organization of the cache. There are several organizations, ranging from fully-associative, via n-way associative to direct-mapped. Associativity can be explained along the lines of the above analogy. We explain associativity for the L2 cache using the bookshelf:

(24)

compete for a single slot a. This is equivalent to a direct-mapped cache. Specific items must go to a specific location. Therefore, if we need another book starting with A, we remove the current book, return it to the library and fetch the next book. Obviously this is unattractive, and we would like to have the option of storing multiple A’s. This is called an n-way associative cache, the n representing at how many possible locations we can put differ-ent a’s. Naturally the size of the cache is still limited, being able to store multiple A’s will still lead to the removal of other books. Fully associative means that any book can go in any location of the bookshelf.

The analogy above explains cache structure contributions to the missrate. In direct-mapped caches the missrate can be negatively influenced by evictions, i.e., two books vying for the same place on the shelf. The above example highlights the diversity of choices facing designers - not only must they decide on the size of the cache, they also have to design its organization. Direct mapped caches are easy and relatively inexpen-sive to implement, while associative caches are more involved to implement and can be more costly.

Processor and system design oversights can result from the design complexity of processors and computer systems (Hennessy and Patterson, 2003). Design oversights can also result from the number of trade-off decisions, the impact of external constraints (like cost and time) and the inability to fully evaluate a design for all relevant workloads (Borkenhagen et al., 2000; Hartstein and Puzak, 2002). Design oversights usually have a negative performance impact and can be difficult to rectify in subsequent revisions of the processor or computer system designs.

1.4 Processor design cases

To establish relevance of this work for processor design, we present three cases where mainstream processors did not arrive at the optimal design point for their target

work-loads. Both the PentiumTM_{4 and UltraSPARC}TM_{III processors suffered from design}

oversights when they were first released, while the ItaniumTM_{processor was less}

suit-able for common commercial workloads than intended.

1.4.1 Intel Pentium 4

Cataldo (2000) describes how the design of the Pentium 4 was intended to extend the Pentium III design into a higher clock-speed domain. This decision was made to lever-age the marketing value of high clock-speeds. The main rationale behind pushing the clock-speed was the success of increasing performance through clock-speed increases on previous generations of the Pentium processor family .

(25)

1.4 Processor design cases 13 classic example of marketing concerns driving technological development. Intel used a deep instruction pipeline to implement this goal, which reduced the amount of real work that the Pentium 4 could do per clock cycle, compared to other CPUs like the Pentium III and Athlon, but allowed it to scale to higher clock speeds (Allbritton, 2002). This soon prompted AMD’s “Megahertz myth campaign”.

The first version of the Pentium 4 was generally considered a poor performer relative to other processors in the marketplace. On a per cycle basis the Pentium 4 performed less work that other processors requiring higher clock-speeds to make up for the differ-ence. The higher clock-speeds translated into higher power requirements. Comparative performance on different workloads showed that performance of the Pentium 4 was highly uneven. Multi-media workloads generally performed very well, while graphics and floating point intensive workloads showed poor performance (Mihocka, 2000).

In Colwell (2005) the design of the Pentium 4 is evaluated and considered to be too complex. The issues of complexity are that it leads to errors in the design that can be expensive to fix (bugs): it leads to suboptimal trade-offs between multiple goals since full evaluation is too expensive given the complexity; complex designs make follow-on designs very difficult; finally, complexity is cumulative in the sense that new designs inherit the complexity of older designs. This complexity inheritance is partly caused by the requirement from the marketplace that a new generation of processor and com-puter system should support most, if not all, of the features of the previous generation of processors. Breaking compatibility with previous generations forces users through po-tentially expensive and disruptive upgrade cycles. The end-users of computer systems prefer to have a faster version of the same chip.

One of the root causes of the poor performance of the Pentium 4 was the very deep pipeline. The initial design had a 20 stage pipeline, primarily to allow for higher clock-speeds. However, the price of the longer pipeline is the increased complexity and the in-creased cost of flushing the pipeline when execution errors (like misspredicted branches) occur. In retrospect, the Pentium 4 pushed the limits of Moore’s Law to a point where the power consumption, performance and cost were no longer attractive. In the end, Intel reverted to the Pentium III design for their later products (Colwell, 2005).

We argue that the example of the Pentium 4 demonstrates how insufficient under-standing of workload behavior can lead to unrealistics goals. We ask ourselves the question; if the designers had been given good workload characterization data, would they have been able to curb marketing’s drive towards higher clock-speed? We argue that good workload characterization data could have illustrated many of the processor’s problems early in the design process.

1.4.2 Sun Microsystems UltraSPARC III

(26)

The compatibility goal was to provide a 90% increase in application program perfor-mance without requiring a recompile of the application. This perforperfor-mance and compat-ibility goal demanded a sizable micro-architecture performance increase while main-taining the programmer-visible characteristics from previous generations (Horel and Lauterbach, 1999). To reach the performance objectives for the processors, the design-ers evaluated increasing performance by aggressive instruction level parallelism (ILP). However, the performance increase obtainable with ILP varies greatly across a set of programs. Instead the processor designers opted to scale up the bandwidths of the pro-cessor while reducing the latencies. This decision was in part based on results from theSPEC CPU95 integer suite. In order to support the high clock-rate and performance

goal, the UltraSPARC III (USIII) was given a 14-stage pipeline. The pipeline depth was identified early in the design process by analyzing several basic paths (Horel and Lauter-bach, 1999). However, long pipelines have additional burdens when the pipeline stalls due to an unexpected event, i.e., a data cache miss. The USIII handles such an event by draining the pipeline and re-fetching the instructions - too many stalls obviously lower processor performance. The designers chose a direct mapped cache, i.e., a cache where each memory location maps to a single cache location. The advantage of direct mapped caches is that they are simple and fast. Direct mapped caches are considered inefficient because they are susceptible to mapping conflicts, i.e., multiple memory addresses are mapped to the same cache-line (Hennessy and Patterson, 2003). Furthermore, the orig-inal design for the L2 cache was optimized for a fast hit time at the expense of a higher missrate. In addition, the memory management unit (MMU) was also optimized for fast lookup at the expense of a higher missrate. Competing processors of that time all used associative caches, a cache structure that maps memory addresses to multiple cache lo-cations - if one location is in use, another may be used. In commercial workloads, with many cache misses, these conflict misses are common. Thus, the USIII combined ex-pensive data cache-miss handling with a cache-architecture susceptible to misses. The USIII attempted to improve cyle time and execution speed for low miss ratio code like

SPEC CPU95. The lack of efficient cache-miss handling resulted in a performance deficit

on important real workloads. Real workloads exhibit significant cache misses. These cache misses lead to pipeline stalls. The frequent misses common to commercial work-loads lowered the performance of the UltraSPARC III processor on these workwork-loads. In fact, the performance increase of the USIII relative to its predecessor was only slight, despite an almost 2× clock-speed increase (Koster, 2004). The lower performance on relevant real workloads placed the USIII at a disadvantage compared to other proces-sors. When the USIII finally taped out, i.e., the first processor prototype was made, the extent of these oversights was discovered. The designers performed some hasty patch work in order to at least approach the performance targets. It seems likely that if the design team had taken real workload characteristics into account throughout the design

process, it would have been clear that theSPEC CPU95 benchmark was inadequate in the

(27)

1.4 Processor design cases 15 like the IBM Power 4, correctly targeted the relevant workload characteristics.

1.4.3 Intel Itanium I

The Itanium processor, with its EPIC instruction set was designed to take maximum advantage of instruction level parallelism in the execution path of applications. EPIC is an implementation of a Very Large Instruction Word architecture. The Itanium was designed as a general purpose processor. As is the case for all VLIW architectures, the Itanium designers relied on advances in compiler technology for compile time op-timization (Schlansker, 1999; Sharangpani and Arora, 2000; Gray et al., 2005). This is necessary since the width of the instruction stream makes it impossible for the hardware to optimize. This in contrast to the RISC instruction set that does attempt to optimize ex-ecution scheduling in the hardware. The quality of the compiler optimizations depends on how well the compiler can predict the most probable execution patterns at compile time (Gray et al., 2005). However, tests on real commercial applications have demon-strated that the Itanium is not ideally suited for the ad-hoc nature of commercial appli-cations. Commercial applications can have numerous data dependent execution paths for which compile time optimization is difficult. The performance impact of numerous data dependent execution paths is further exacerbated by current compiler limitations (Hennessy and Patterson, 2003). The Itanium design does excel on some applications that can be iteratively tuned for maximum performance using repeated execute-profile-optimize cycles (Shankland, 2005). While not a design oversight per-se, this tuning requirement highlights the importance of using representative benchmarks in the design process. Representative benchmarks could have demonstrated the extreme degree of compiler support required to optimize data-dependent execution paths.

(28)

design when they are not evaluated against a representative set of workloads.

1.5 Computer system design considerations

The first computer systems that supported multiple Itanium processors were designed with several processors on the same memory bus. Unfortunately, the memory bus did not provide the bandwidth required by the Itanium processors. The lack of memory bus bandwidth severely limited memory throughput and increased memory latency for each processor, thus lowering computer system performance (Zeichick, 2004; Shankland, 2005). Processor performance depends on the ability of the system enclosure to provide the data and instructions needed to sustain execution. If computer system bandwidth is a bottleneck, the processors have to wait until the data arrives. As noted in Kunkel et al. (2000), for commercial computer systems a proper balance between the perfor-mance requirements of the processors and the capabilities of the computer systems is essential. Evaluating benchmarks representative of the bandwidth requirements of com-mercial applications could have provided quantitative data on bandwidth requirements. This quantitative data could have prevented the lack of memory bus bandwidth. Mis-takes like these are expensive to rectify and are best detected during the design phase. The hierarchical model paradigm was specifically designed to prevent these issues hap-pening during the design cycle. We can only speculate at the underlying cause for this design oversight.

The interaction of processors and computer system is of fundamental importance to application performance. While having sufficient processors is necessary to reach higher levels of performance, having more processors does not necessarily increase performance. Many workloads suffer from internal inefficiencies that are exacerbated by bottlenecks in the computer system. In these cases, solving for computer system bottlenecks is more important than increasing processor performance. Gunther (1998) presents an example where application performance is negatively impacted by increas-ing the number of processors. The example explains that a database server can have a bottleneck in its I/O channel when reading or writing data to the disks. By adding more processors, the number of outstanding transactions increases, further stressing the I/O channel and thus exacerbating the performance problem. In this example the cor-rect action to improve performance would have been to increase the capacity of the I/O channel. A commercial server should therefore have sufficient I/O capacity for the most demanding workloads at a reasonable cost. Insufficient bandwidth capacity will lead to execution bottlenecks and hence to poor performance.

(29)

design-1.6 Workload and benchmark considerations 17 ers need to weigh the requirements of the workloads carefully to make optimal choices regarding the desired capacities for the computer system. This is the case for nearly all resources in the computer system, number of processors, memory capacity, internal bandwidth and I/O capacity.

This leads to the question - do real workloads provide a richer picture of computer system requirements than currently presented by standard benchmarks?

1.6 Workload and benchmark considerations

For commercial applications from, for example SAP, PeopleSoft1_{and Oracle, extensive}

sizing and capacity planning is required to predict the optimal machine configuration for a given customer workload (Cravotta, 2003). These sizing and capacity requirements are based on the computer system performance on standardized, application-specific benchmarks. Yet, even after sizing and capacity planning, customers regularly require a machine that has more processors, more memory and more I/O capacity than ini-tially predicted by these benchmarks (Golfarelli and Saltarelli, 2003). This increase in requirements is often caused by differences in workload characteristics between the ap-plication in benchmarks and in real use. Real workload requirements frequently exceed the workload requirements of standardized application benchmarks. The differences between the requirements for application specific benchmarks and real customer work-loads illustrate the value of information on real workload characteristics throughout the processor and computer system design process. Not taking into account the increased requirements of real applications can lead to computer systems that are designed with insufficient capacity (Kunkel et al., 2000). As noted in the previous chapter, insufficient capacity leads to poor performance.

In addition to real workload requirements exceeding the benchmark predictions, there is a difference in perspective. The application vendor is interested in achieving a high level of performance on computer systems that are relatively inexpensive. The applica-tion vendor therefore has an incentive to make the common case fast and efficient. The end-user interests are to support their business by using the application, not necessar-ily to run the common case fast. In the end-user (customer) environment, decisions on application characteristics are not based on what will perform well but rather on what supports their business. The business requirements guide the use and modification of an application. As a result, the same application deployed at different businesses may have very different workload characteristics. This property of real workloads is hard to in-clude in a design process that relies exclusively on standardized benchmarks for guiding design decisions. Supplying additional information on real workloads helps designers understand the breadth of application specific workload characteristics.

A side-effect of real workloads is described as “software rot” (Brooks, 1995). Ap-plication software undergoes a steady stream of changes early in the design cycle. As

(30)

the software ages the development pace slows down until it stops prior to the next re-lease. Software “rot” comes from the compound effects of these changes on the applica-tion. Users adapt to most application bugs by developing a workaround. The combined effect of software changes and user workarounds is to push the workload characteris-tics away from what was benchmarked for the initial sizing. While some application changes improve performance, others will not. This combination of application adapta-tion and evoluadapta-tion creates a much greater diversity than can be covered by standardized benchmarks. Designers require quantitative feedback regarding these extended usage parameters when evaluating trade-off decisions. In many cases of software “rot”, the application is too old to warrant any additional development. It is in the customers best interest to keep the application running as fast as possible. The guidance to computer system and processor designers from these applications is to make sure that bad/old software executes well on new computer systems. Improving performance for poor or old software is more beneficial to users than requiring application recompilation.

1.7 Increasing complexity and increasing diversity

There are no indications that processors and computers systems are getting less complex (Colwell, 2005; McNairy and Bhatia, 2005; Ranganathan and Jouppi, 2005; Kongetira et al., 2005; Spracklen and Abraham, 2005). Single thread performance on a processor has reached the point of diminishing returns. Further increasing single thread perfor-mance requires a much greater investment of time and resources than is warranted by the expected gains (Colwell, 2005). The transition from single thread, single core pro-cessors to multi-threaded, multi-core processes, reflects the difficulty of further increas-ing sincreas-ingle thread performance. Includincreas-ing more execution cores on a processor, each with several concurrent hardware threads, allows processor designers to achieve higher instruction rates by performing more work in parallel (Nayfeh et al., 1996; Olukotun et al., 1996; Kongetira et al., 2005). This change, from single core/single thread to multi core/multi thread processor designs, further increases the simulation burden in time and effort. The simulation time is strongly related to the total number of instructions that have to be simulated.

(31)

1.8 Representing real workload characteristics in the design process 19 cheap computer resources to specific applications like web-servers. At the same time computer system users attempt to improve their efficiency by using virtualization and software partitioning to host multiple applications on a single computer system. The combined load of the shared applications will push computer system utilization higher than would be the case if each application was on dedicated hardware. Virtualization software hides each application from the other and distributes available computer re-sources between them. For computer system and processor designers virtualization in-troduces additional complexities since traditional workloads no longer exist. Designers require feedback on the workload characteristics of these combined, virtualized, work-loads even though each combined workload may be unique.

In an attempt to reduce simulation cost, work is done to better understand the com-position of standardized benchmark suites like SPEC and TPC. The goal is to remove redundant characteristics from the simulation set, thus optimizing the information gain per simulation (Vandierendonck and De Bosschere, 2004b; Phansalkar et al., 2005b; Eeckhout et al., 2005b). It increases the efficiency of simulating these standardized benchmark suites, but does not assist in identifying changes in real workload behavior and related trends. The ability to reduce standardized benchmarks into a subset that not only represents all important workload characteristics but is efficient to simulate, is valuable in the design process. Of equal value is quantitative feedback on the diver-sity of real workloads and the coverage of relevant real workload characteristics in that reduced set of benchmarks. Over-simplification is the risk introduced by the standard-ized benchmarks sets, further compounded by the effort to reduce them. One single benchmark is unable to capture the richness of real workloads. Reducing the diversity of benchmarks in the design process can increase the risk of unwanted surprises.

1.8 Representing real workload characteristics in the design process

Computer system performance is the most important distinguishing attribute designers attempt to optimize. Computer system and processor designers therefore require accu-rate workload characterization data to help them make trade-off decisions that improve computer system performance on relevant real workloads. Benchmarks commonly pro-vide the required workload characterization data, since benchmarks are generally effi-cient to use and easy to control.

Currently, only a limited set of computer system independent benchmarks are used to compare computer system performance and to design new computer systems. How-ever, many important characteristics of real workloads required for future computer systems design might not be adequately represented in this small set of commonly used benchmarks. Identifying important workload characteristics as well as quantify-ing the representativeness of benchmarks therefore remains an issue. Quotquantify-ing Skadron et al. (2003) “..the R&D community does not have a system for identifying important

(32)

de-termine] what portion of the total behavior space each benchmark really represents”.

Without information on the representativeness of benchmarks and the relevance of the workload characteristics they represent, designers have no insight into the needs and requirements of real workloads.

Benchmarks used for processor and computer system design should thus be chosen based on how well they represent relevant workloads. If those benchmarks do not exist, candidate benchmarks should be found among these relevant workloads. Finding and defining these benchmarks requires that we can quantitatively determine representative-ness. How are benchmarks or workloads representative of other workloads? They can be representative in two ways: (i) they accurately predict application performance, (ii) they accurately mimic application characteristics (Ferrari, 1978; KleinOsowski et al., 2001; Eeckhout et al., 2002, 2005b).

Representativeness of benchmarks relative to each other and to workloads is therefore the key issue. Designers need to keep track of real world characteristics and evaluate their impact on computer system design, yet they are limited by the impossibility of performing detailed, cycle-accurate simulations for more than a handful of benchmarks. The main problem area of this work can be summarized as:

How can we identify important workload characteristics and find or define the benchmarks that represent them ?

In Section 1.4 the requirement of backwards compatibility for processors was men-tioned. Consequently, the design of a new processor caries the burden of compatibility with the previous generation. This means that the selection of representative workloads for the design process can be made specific to a processor micro-architecture. Workload characterization data collected on current processors provide the information required to improve the next processor generation.

Representing real workload data in the design process is essential since it provides insight into workload characteristics not provided by benchmarks. However, real work-loads are not practical for the detailed analysis required in the processor and computer system design process. Real workloads lack the compactness and controllability as-sociated with benchmarks (Kunkel et al., 2000). Therefore designers are faced with a dilemma: there is clear value in understanding the workload characteristics of a broad set of real workloads, yet the cost in time and resources required to characterize all these workloads sufficiently, makes their inclusion economically unattractive. We pro-pose that the best way of including these relevant workload characteristics is to find or define a minimal set of workloads which represent all characteristics relevant for the specific processor micro-architecture and computer system design.

(33)

1.9 Research questions 21 Real workload space Optimal reduced benchmark space Representative workload space Standard benchmark space simulation metrics computer system metrics

Candidate workload selection Standard benchmark selection Benchmark similarity analysis

Fig. 1.3: Overview of benchmark set creation

in the standard benchmark space. The representative workload space captures all im-portant workload characteristics of the real workload space and provides the candidates for the standard benchmark space. Currently the steps from the standard benchmark

space to the processor and system design space are reasonably well understood. Most

of the simulation-based results available in the literature target these steps, e.g., Eeck-hout et al. (2003b) and Phansalkar et al. (2005b). Much less studied and understood, and the focus of this work, is the selection of representative workloads from the real

workload space to span the representative workload space. The representative

work-load space must reflect all significant workwork-load characteristics identifiable in the real

workload space.

1.9 Research questions

Computer system and processor designers need to quantify two issues related to relevant workloads and the benchmarks used to represent them. The first issue is to determine - what are the important workload characteristics and requirements? Workload charac-teristics are the required size of memory, recommended cache size, etc. Characcharac-teristics can be viewed as common between classes of workloads, i.e., they can be defined in advance. Workload requirements are particular to a workload; they depend on the spe-cific processor dependent implementation of the workload as well as the nature of its work. These workload requirements must be addressed during computer system design. This requires workload characterization to evaluate the impact of these requirements on computer system and processor design. For example, if studies find that a majority of the workload would benefit from larger caches, then the designers could increase the caches at the expense of other optimizations. These design trade-offs can only be evaluated in simulation, using workloads that are representative of these requirements. Information from real workloads is needed to determine the relative importance of these requirements and select the representative workloads used for simulation.

(34)

We believe that real workloads are the key to representative workload selection. We must find an approach that allows us to characterize real workloads and use that char-acterization to guide the selection of representative workloads. Since real workloads are tied to specific processor micro-architectures, we limit our approach to a specific micro-architecture. Choosing a single micro-architecture is reasonable since it reflects the reality faced by IBM (PowerPC), Intel (X86, X64) and Sun Microsystems (SPARC), each with their own specific micro-architecture implementation.

The approach thus requires the ability to select representative workloads from a set of measured workloads. This requires that the approach supports a measure of repre-sentativeness for workloads on computer systems of the same micro-architecture. The overarching goal is to select a set of workloads that represent the workload space for that specific processor architecture. Yet given the constraints in time and resources, we would like this set to be as small as possible. This leads us to the primary research question:

Research Question 1 How do we find the smallest set of workloads representative of a

specific micro-architecture workload space?

While the ability to select such a workload set by itself is valuable, it must also be practical in use. This selection process must therefore be efficient in use. We therefore introduce a second research question to address this practicality requirement.

Research Question 2 How can we efficiently find a smallest workload set

representa-tive of hundreds or thousands of collected workloads?

Efficiency is necessary since broad characterization of the workload space combined with the market value of portions thereof, should solidify the approach as an important tool for the design phase of computer systems and processors.

1.10 Research approach

Scientific inquiry can be thought of as a particular process or strategy in which a set of

research instruments are employed, guided by researchers using an underlying research philosophy. The following sections discuss the philosophy, strategy and instruments

applied in the pursuit of the research objectives.

1.10.1 Philosophy

(35)

1.10 Research approach 23 There are two major philosophies, or “schools of thought”, in the Western tradition of science. These philosophies are the “hard” positivist research tradition and the “soft” interpretivist research tradition (Hirschheim, 1992). Positivists believe that reality is stable and can be observed and described from an objective viewpoint, i.e., without in-terfering with the phenomena being studied. Positivists contend that phenomena should be isolated and that observations should be repeatable. Thus predictions can be made on the basis of previous observations and their explanations. Positivism has a particularly strong and successful association with the physical and natural sciences and concen-trates on laboratory experiments, field experiments and surveys as its primary research instruments (Hirschheim, 1992; Galliers, 1994). In contrast, Interpretivists contend that only through the subjective interpretation of and intervention in reality can that real-ity be fully understood. The study of phenomena in their natural environment is key to the interpretivist philosophy, together with the acknowledgement that scientists can-not avoid affecting those phenomena under study. Interpretivists understand that there may be many interpretations of reality, but consider these interpretations a part of the scientific knowledge they are pursuing. Interpretivist researchers use “soft” research in-struments such as reviews, action research and forecasting (Hirschheim, 1992; Galliers, 1994).

It has been often observed that no single research methodology is intrinsically better than any other methodology, though some institutions seem to favor certain methodolo-gies above others (Galliers and Land, 1987). Such a favored approach conflicts with the fact that a research philosophy should be based on the research objective rather than the research topic (March and Smith, 1995).

This thesis reflects a pragmatic approach to an engineering question: how to include

empirical evidence in the design process to improve computer systems. We believe that

we can always make a better computer even though the best computer might never exist. Similarly we believe that the inclusion of measurements from different workloads will bring us closer to what is happening in reality. These beliefs are post-positivistic. Post-positivism is a branch of passivism; its most common form is critical realism,

i.e., the belief in a reality independent of our thinking that science can study.

Post-positivism uses experimental methods and quantitative measures to test hypothetical generalizations.

1.10.2 Strategy

Consistent with our post-positivistic research philosophy is the deductive-hypothetic research strategy or scientific method, illustrated in Figure 1.4.

(36)

Deductive-Hypothetic research strategy Define the question Gather information and resources Form hypothesis Prescriptive model Perform experiment and collect data Analyze data Interpret data and

draw conclusions Publish

results

Fig. 1.4: deductive-hypothetic research strategy

1.10.3 Instruments

According to Galliers (1994), research instruments in quantitative research are labora-tory experiments, field experiments, surveys, case studies, theorem proof, forecasting and simulation. To establish the problem statement and derive the solution require-ments, we survey the literature as well as personal experience. After formulation of the research hypotheses, we postulate a solution for each hypothesis in the form of a prescriptive model. This prescriptive model makes falsifiable predictions. To verify these predictions we use field data, laboratory experiments or simulation. Data analysis allows conclusions that validate or reject the hypotheses.

1.11 Research outline

(37)

1.11 Research outline 25

design

Collect data

Form hypothesis

Define the question

design

Collect data

characterization data Prescriptive model Chapter 4 - Constructing the approach Present conclusions Chapter 9 - Evaluating workload similarity Analyze data Chapter 7 - Grouping together similar workloads Form hypothesis Chapter 3 - Towards an unbiased approach for selecting representative

workloads

Early test

Chapter 5 - Testing the methodology on a

benchmark set

(38)

(39)

Part I

(40)

(41)

2. CURRENT APPROACHES FOR WORKLOAD SELECTION IN

PROCESSOR AND COMPUTER SYSTEM DESIGN

ABSTRACT

Workload selection for computer system design requires understand-ing technological developments, marketplace requirements and customer workloads. Ideally, computer system designers have a benchmark set that is representative of their customers’ behavior. The value of standardized

com-puter system performance evaluation, provided by e.g.,SPEC CPU2000 and

TPC-C, is limited by their purpose as platform independent performance

(42)

design

Collect data

Form hypothesis

Finding representative workloads for computer system design

F

INDING REPRESENTATIVE

F

INDING REPRESENTATIVE

WORKLOADS FOR COMPUTER SYSTEM

DESIGN

PROEFSCHRIFT

CONTENTS

1. THE IMPORTANCE OF WORKLOADS IN COMPUTER SYSTEM

DESIGN

1.1 Introduction

1.2 Designing computer systems and processors

1.3 Overview of processor and computer system design

1.4 Processor design cases

1.5 Computer system design considerations

1.6 Workload and benchmark considerations

1.7 Increasing complexity and increasing diversity

1.8 Representing real workload characteristics in the design process

1.9 Research questions

1.10 Research approach

1.11 Research outline

Part I

2. CURRENT APPROACHES FOR WORKLOAD SELECTION IN

PROCESSOR AND COMPUTER SYSTEM DESIGN