• Nie Znaleziono Wyników

Computational Performance

N/A
N/A
Protected

Academic year: 2021

Share "Computational Performance"

Copied!
18
0
0

Pełen tekst

(1)

Analysis and modeling of

Computational Performance

(2)

Computational Performance

Performance (efficiency) is (besides correctness, reliability, security, maintainability, user friendliness, etc.) one of the most important software qualities

Performance, as is understood in the current lectures, has its main related parameter: time-to-solution (execution time)

guideline: performance = 1/time-to-solution

Analysis of computational performance is concerned with elements that influence the time of program execution

Performance modeling tries to express the execution time in terms of mathematical formulas, using a set of theoretically or experimentally obtained parameters

Performance optimization finds ways to improve the

computational performance of programs and minimize its

execution time

(3)

Computational Performance

In different application areas execution time depends on many different factors:

time for performing operations by CPUs

time for accessing data in DRAM memory

time for sending data over network

time for accessing disk drives, SSDs, etc.

time for performing transactions with databases

time for displaying images and graphics primitives

time for creating and displaying video frames

In the current lectures we are concerned with programs for which execution time depends on the three first

factors above

(4)

Computational Performance

Current lectures:

simple programs in C

• micro-benchmarks for individual system components and simple operations

• operations on vectors and matrices – numerical linear algebra

hardware-software interaction

• assembly code

benchmarks

optimization

• classical – manual and automatic by compilers

• parallel

multithreading (CPU, GPU)

message passing

(5)

Performance tools

Execution time:

wall clock time, elapsed time, real time – external time measure, the most important for software users

CPU time – time when CPU was executing program instructions

• user time – time in user mode

• system time – time in kernel mode

Tools for measuring wall clock and CPU time

• wrist watches, stopwatches

• top utility, system monitors

• time utility in Linux

• profilers: gprof, valgrind

• hardware counters

• special performance analysis applications

Intel VTune, Advisor, NVIDIA Visual Profiler, AMD uProf

(6)

Performance tools

Intel Vtune – a complex performance analysis tool

(7)

Performance tools

Profiling

collecting performance related data concerning a given program

• the main usage of profiling is to give the time spent in different parts of the code

subroutines (functions)

blocks of code

individual lines of cod

e

• profiling can also report other events during program execution, that can be e.g. used to create:

call graph

instruction and subroutine (function) number of executions

profiling information can be stored and communicated in different ways

• summary information

• traces

• on-line monitoring

(8)

Performance tools - tracers

A typical output of a popular Vampir tool for MPI tracing

(9)

Performance tools

Profiling

profilers can collect data using different mechanisms:

• instrumentation (gprof)

inserting additional code to report the events related to execution and state of the program (e.g. call stack)

instrumentation requires special compilation

• execution simulation (valgrind)

execution of the program using a special virtual machine

simulation incurs significant overhead

• statistical sampling (gprof)

program execution is interrupted at specified time intervals and the state of the execution environment is stored (e.g. call stack)

• event notification

for environments (virtual machines) equipped with suitable capabilities

(10)

Performance tools - gprof

Steps for

gprof

profiling (using

gcc

compiler):

compilation with instrumentation

$ gcc -p source_file.c

standard execution (

gmon.out

file created)

$ a.out

displaying results (binary file as argument, not

gmon.out

)

$ gprof a.out

part of typical output:

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name 45.72 0.48 0.48 31241 0.02 0.02 fun_1 20.96 0.70 0.22 10 22.00 22.00 fun_2 15.24 0.86 0.16 31241 0.01 0.01 fun_3 ...

the output can be redirected to a file (

$ gprof a.out > file.txt

)

(11)

Hardware counters

Hardware counters (performance monitoring counters) are special registers for storing the numbers of hardware events related to performance

hardware counters are specific for each processor architecture

hardware counters are mainly used to support the design and testing of new architectures, as well as fine tuning of compilers and system software

• hardware events can be very detailed, reflecting the complex nature of contemporary processors

example: "IDQ_UOPS_NOT_DELIVERED.CORE - Counts the number of uops not delivered to Resource Allocation Table (RAT) per thread adding “4 – x” when Resource Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0,1,2,3})

there are hundreds of hardware events that can be reported by

hardware counters

(12)

Hardware counters

The most important events are related to:

time measurements – clock cycles counters

instructions executed – especially branches and flops

cache and memory access related events – especially cache hits and misses

There are several applications that provide the interface to hardware counters for different processors and

programming environments

the basic one for Linux, for recent kernels, is perf utility (evolved from Performance Counters for Linux), based on

perf Linux subsystem and kernel support

other popular for Linux:

• o'profile

• Performance Application Programming Interface (PAPI) – used during our course

(13)

Performance tools – perf

Standard usage of

perf stat

:

$ perf stat a.out

Typical output:

Performance counter stats for 'a.out':

0,649995 task-clock (msec) # 0,697 CPUs utilized 21 context-switches # 0,032 M/sec 0 cpu-migrations # 0,000 K/sec 294 page-faults # 0,452 M/sec 2443055 cycles # 3,759 GHz 2486027 instructions # 1,02 insn per cycle 490849 branches # 755,158 M/sec 14307 branch-misses # 2,91% of all branches ...

more details can be obtained with options

$ perf stat -d -d a.out

(14)

Optimizing compilers

Usually in order to get maximal performance for the code on a given hardware sophisticated optimizing compilers have to be used

Optimization is performed by compilers usually after syntax analysis and before object code generation

some options, e.g. parallelization, can be realized in a preprocessing stage by suitable compiler modules

Optimization operates on some intermediate form of the code that usually utilizes:

registers

basic blocks

Basic block is a fundamental portion of the code for which optimization is performed

basic block is a sequence of instructions having the property, that if one of them is executed than all of them are executed

• it is impossible to jump out of a basic block (jumps end a block)

• it is impossible to jump into the block (targets of jumps are beginnings of basic blocks)

(15)

Optimizing compilers

Source code:

while( j < n ) { k = k + 2j;

m = 2j;

j++;

}

Compiler produced assembler

.L2 movl -4(%ebp), %eax // j -> eax cmpl -12(%ebp), %eax // n <> eax ? jl .L4

jmp .L3

.L4 movl -4(%ebp), %eax // j-> eax movl %eax, %edx // j->edx

leal 0(,%edx,2), %eax // eax=2*edx

addl %eax, -8(%ebp) // k+=eax (k+=2*j) movl -4(%ebp), %eax // j->eax

movl %eax, %edx // j->edx

leal 0(,%edx,2), %eax // eax=2*edx

movl %eax, -16(%ebp) // m=eax (m=2*j)

incl -4(%ebp) // j++

jmp .L2 .L3

Intermediate code:

A: t1 := j;

t2 := n

t3 := t1 < t2 jmp (B) t3 jmp (C) B: t4 := k

t5 := j

t6 := t5 * 2 t7 := t4 + t6 k := t7

t8 := j

t9 := t8 * 2 m := t9 ....

jmp (A) C: ...

(16)

Optimizing compilers

Before optimization:

.L2 movl -4(%ebp), %eax // j -> eax cmpl -12(%ebp), %eax // n <> eax ? jl .L4

jmp .L3

.L4 movl -4(%ebp), %eax// j-> eax movl %eax, %edx // j->edx

leal 0(,%edx,2), %eax // eax=2*edx addl %eax, -8(%ebp) // k += eax movl -4(%ebp), %eax// j->eax

movl %eax, %edx // j->edx

leal 0(,%edx,2), %eax // eax=2*edx movl %eax, -16(%ebp) // m=eax

incl -4(%ebp) // j++

jmp .L2 .L3

Compiler optimized version 1:

.L4

leal (%edx, %eax, 2), %edx // edx+=2*eax leal 0(,%eax,2), %ecx // ecx=2*eax

incl %eax // eax+=1

cmpl %ebx, %eax // n<>eax ? jl .L4

Compiler optimized version 2 (with IVS):

.L4:

addl $1, %ecx // j++

addl %eax, %edx // k+=m

addl $2, %eax // m+=2

cmpl %r8d, %ecx // n<>j ? jne .L4

(17)

Optimizing compilers

Contemporary compilers can have dozens of optimization options

examples (for gcc):

• -fstrength-reduce, -fcse-follow-jumps, -ffast-math, -funroll-loops, -fschedule-insns, -finline-functions, -fomit-frame-pointer

important optimizations concern parallelization and vectorization

• often in order to use particular optimizations for a given hardware

(concerning e.g. vectorization) special options have to be passed explicitly to the compiler – e.g. -march=core-avx2 – for cores with AVX2 instructions

• often directives in source code help compilers to optimize

In practice, most often compiler optimization is applied using options for optimization levels

typical levels and performed optimizations are:

• -O0 – no optimization

• -O1 – optimize for execution time and code size

• -O2 – more optimization options applied, without sacrificing too much time and going into options that can alter the results of code execution

• -O3 – the most aggressive optimization

• (some compilers can have more levels, e.g. for vectorization, parallelization)

(18)

„Numbers every programmer should know”

Examples:

L1 cache reference 1 ns

Branch mispredict 5 ns

L2 cache reference 5 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Send 4K bytes over 10 Gbps network 10,000 ns

Transfer 1MB to/from PCI-E GPU 80,000ns

Round trip within same datacenter 500,000 ns

Read 1 MB sequentially from SATA SSD 2,000,000 ns

Read 1 MB sequentially from disk 5,000,000 ns

Read 1 MB sequentially from disk 30,000,000 ns

Send packet CA->Netherlands->CA 150,000,000 ns

Current list:

https://gist.github.com/eshelman/343a1c46cb3fba142c1afdcdeec17646

Cytaty

Powiązane dokumenty

➔ The optimization should concern parts of the code most important from the performance point of view.  &#34;hot spots&#34; can be identified through algorithm and source code

If you want to change the language of the user interface and Help as you start the application, enable the Ask me the next time the software starts check box.. Restart

If the WMA price hits or crosses the barrier, then the option contract is automatically exercised and the owner of option contract gets immediately K − B in case of a put and B − K

When we resign from the assumption on the lack of transaction costs at time 0, we may define the seller’s price as a minimal amount of money either in bonds or in shares of stock

abstract: The talk will outline several results concerning set-theoretic problems related to the existence of localizations in various categories. It will be addressed

With the proliferation of the various forms of higher education, especially the setting up of numerous private schools, and a very big increase in the number of students, scien-

Section 3 contains the core of the paper: the transformation of the equation by integration by parts, the collocation method on the resulting weakly singular Volterra equation, and

Przeprowadzona w tym rozdziale analiza dotyczy kszta átowania siĊ ceny dra- binowej opcji sprzeda Īy oraz wpáywu na cenĊ rozpatrywanej opcji nastĊpujących czynników: ceny