Counterexample-Driven Genetic Programming

(1)

Counterexample-Driven Genetic Programming

Iwo Błądek, Krzysztof Krawiec

Institute of Computing Science, Poznań University of Technology

Poznań, 12.12.2017

(2)

Plan of the Presentation

1 Introduction

2 Counterexample-Driven GP (part 1)

3 Satisfiability Modulo Theories (SMT)

4 SMT-based verification

5 _{Counterexample-Driven GP (part 2)}

(3)

Program Synthesis

Aim of program synthesis is to find a program given its desired behavior.

x2+ x − 2? a ∨ b ∧ c?

bool isGreater(int x, int y) {

if (x > y ) return true; else return false; }

?

false x = 1, y = 5 true x = 5, y = 1 1 a = 1, b = 0, c = 0 0 a = 0, b = 0, c = 0 0 x = 1 −2 x = 0

(4)

Program Specification

The desired behavior of a program is described by its specification. The

most commonly used types of specifications are:

Input-output examples (test cases)

x y max(x,y)

0 0 0

1 0 1

3 4 4

· · · ·

Formal specification (logical relations between input and output)

∀_{x ,y} max (x , y ) ≥ x ∧

(5)

Program Specification

Input-output examples are a special case of formal specification:

∀x ,y ([(x = 0 ∧ y = 0) =⇒ max (x , y ) = 0] ∧ [(x = 1 ∧ y = 0) =⇒ max (x , y ) = 1] ∧ [(x = 4 ∧ y = 3) =⇒ max (x , y ) = 4] ∧ · · · ) Or simpler: max (0, 0) = 0 ∧ max (1, 0) = 1 ∧ max (3, 4) = 4 ∧ · · ·

(6)

Provably-correct programs

To prove that program is correct for all inputs, a formal specification needs to be used (or all possible input-output examples enumerated) In certain applications hoping for algorithm’s generalization is not enough:

Safety-critical systems (!) Software development Hardware design

(7)

Plan of the Presentation

1 Introduction

6 Experiments

(8)

Counterexample-Driven GP

First described in: K.Krawiec, I.Błądek, J.Swan,

“Counterexample-Driven Genetic Programming”, Proceedings of the

Genetic and Evolutionary Computation Conference. GECCO ’17. Berlin, Germany. [1]

(Best Paper Award in the Genetic Programming track)

(9)

Counterexample-Driven GP

Goal:

Automatic synthesis of programs which are proved to correctly realize a formally defined task

Contributions:

GP algorithm capable of synthesizing such programs from formal specifications (spec for short)

Systematization of spec properties and their consequences for program evaluation

https://github.com/kkrawiec/CDGP

(10)

Counterexample-Driven GP

Main ideas:

Verify programs generated by GP to guarantee their correctness Incrementally supply GP with test cases created from

counterexamples returned by failed verifications

Testing Verification All passed? T_c GP search Fitness Program Counterexample

(11)

Plan of the Presentation

1 Introduction

6 Experiments

(12)

Satisfiability Problem (SAT)

Question: Is a formula in the propositional calculus satisfiable?

Examples:

¬a ∨ b

SAT: a = false, b = true

a ∧ ¬a ∧ b

(13)

Satisfiability Modulo Theories (SMT)

Question: Is a formula in the first-order logic satisfiable under the

background theory T , which defines semantics of a certain set of symbols?

Examples:

Logic: QF_LIA (Quantifier-Free Linear Integer Arithmetic)

x , y , z ∈ Z a ∈ {false, true} (10 · x = 20) ∧ a SAT: x = 2, a = true (x < y ) ∧ (y < z) ∧ (z < x ) UNSAT (x ≤ y ) ∧ (y ≤ z) ∧ (z ≤ x ) SAT: x = 0, y = 0, z = 0

(14)

Satisfiability Modulo Theories (SMT)

Examples:

Logic: NIA (Non-Linear Integer Arithmetic)

x , y ∈ Z x2+ 1 ≤ 2 · x SAT: x = 1

∀x ,y (x + y )2> x2+ y2

(15)

Satisfiability Modulo Theories (SMT)

Examples:

Logic: QF S (Quantifier-Free Strings)

x , y ∈ String

str.len(x ++ y ) 6= str.len(x ) + str.len(y ) UNSAT

str.substr(x , 0, 2) = str.substr(x , 2, 2) ∧ str.at(x , 0) 6= str.at(x , 1)

SAT: x = "abab"

(16)

SMT Solvers

SMT Solver – any software that can check satisfiability of formulas modulo some given theory.

There are several freely accessible SMT solvers: CVC4 (open source)

MathSAT (free for non-commercial use) Z3, by Microsoft Research (open source)

Most SMT solvers accept queries in SMT-LIB language

(http://smtlib.cs.uiowa.edu/), which was created to standardize interaction with different solvers.

(17)

Short SMT-LIB course

(set-logic LIA)– sets theory to Ints and limits considerations to

only linear integer arithmetic

(declare-fun x () Int) – declares a free variable x of type Int

(define-fun fun ((a Int)(b Int)) Int (+ a b))– defines a

function for later use

(assert (= (fun x 2) 0)) – adds an assertion (constraint)

(check-sat)– a command instructing solver to check satisfiability

(18)

Short SMT-LIB course

(19)

Short SMT-LIB course

(20)

Short SMT-LIB course

(21)

Short SMT-LIB course

(22)

Plan of the Presentation

1 Introduction

(23)

SMT-based program verification

To check program correctness we need to prove: ∀in Pre(in) =⇒ Post(in, p(in))

which can be posed as disproving:

∃_in Pre(in) 6=⇒ Post(in, p(in)).

p – a program

in – inputs to the program

Pre(in) –precondition. Behavior of the program is defined only for inputs that satisfy Pre(in).

Post(in, out) –postcondition. Describes the expected behavior of the program.

(24)

An example of SMT verification

Task:

Synthesize a program max (x , y ) returning a maximum of two numbers.

max (x , y ) = ( x , if x ≥ y . y , otherwise. Formal specification: max (x , y ) ≥ x ∧ max (x , y ) ≥ y ∧ (max (x , y ) = x ∨ max (x , y ) = y )

(25)

An example of SMT verification

Incorrect program: if x < y: res = x else: res = y

Program encoded as SMT formulas:

; PROGRAM (=> (< x y) (and (= res x) (= |res’’| res))) (=> (not (< x y)) (and (= |res’| y) (= |res’’| |res’|)))

(26)

An example of SMT verification

Verification constraints in SMT-LIB for the incorrect program:

; PRECONDITION (assert true) ; PROGRAM

(assert (=> (< x y)

(and (= res x) (= |res’’| res)))) (assert (=> (not (< x y))

(and (= |res’| y) (= |res’’| |res’|)))) ; POSTCONDITION

(assert (not (and (>= |res’’| x) (>= |res’’| y) (or (= |res’’| x)

(27)

An example of SMT verification

Incorrect program: if x < y: res = x else: res = y Solver result: SAT x = -1 y = 0

This answer means that the program is not correct and solver provides us a counterexample.

(28)

An example of SMT verification

Correct program: if x > y: res = x else: res = y Solver result: UNSAT

This answer means that the program is correct with respect to the specification. No counterexample was found.

(29)

Plan of the Presentation

1 Introduction

6 Experiments

(30)

Counterexample-Driven GP

Main ideas:

Verify programs generated by GP to guarantee their correctness Incrementally supply GP with test cases created from

counterexamples returned by failed verifications

Testing Verification All passed? T_c GP search Fitness Program Counterexample

(31)

Program representation

Programs in SMT-LIB Language Purely functional

Represented in GP as traditional program trees

Examples:

x

(mod x 2)

(* 2 (ite (>= x y) 2 (+ x y)))

(32)

Search operators

Mutation (50%) / Crossover (50%) Only well-typed programs are generated

Examples: Mutation: (ite (>= x y) 2 (+ x y)) ↓ (ite (< y 0) 2 (+ x y)) Crossover: (mod x 2) (ite (>= x y) 2 (+ x y)) ↓ (mod (+ x y) 2) (ite (>= x y) 2 x)

(33)

Verification

Generated solutions are verified if they pass a certain ratio of test cases (TestsRatio)

unsat – program correct sat – program incorrect

Verification query to SMT solver: (set-logic LIA)

(define-fun max ((x Int)(y Int)) Int (ite (>= y x) x y)) (declare-fun x () Int)

(declare-fun y () Int) (assert (not (and

(>= (max x y) x) (>= (max x y) y)

(or (= (max x y) x) (= (max x y) y))))) (check-sat)

(get-value (x y))

(34)

Example run

Log for Median3 problem:

Tests: found: 0 total: 0 known outputs: 0

Added test: (Map(a -> -70, b -> 0, c -> -70),Some(-70)) Added test: (Map(a -> -1, b -> -8, c -> 0),Some(-1)) Added test: (Map(a -> 0, b -> 5, c -> 0),Some(0)) Added test: (Map(a -> 0, b -> 0, c -> 1),Some(0)) Added test: (Map(a -> -2, b -> -1, c -> -2),Some(-2)) Added test: (Map(a -> -1, b -> 0, c -> 0),Some(0)) Added test: (Map(a -> 0, b -> 0, c -> -3),Some(0)) Added test: (Map(a -> 0, b -> 1, c -> -1),Some(0)) Tests: found: 8 total: 8 known outputs: 8

Added test: (Map(a -> -1, b -> 0, c -> -1),Some(-1)) Tests: found: 1 total: 9 known outputs: 9

Added test: (Map(a -> -7, b -> -6, c -> -7),Some(-7)) Tests: found: 1 total: 10 known outputs: 10

Tests: found: 0 total: 10 known outputs: 10 Tests: found: 0 total: 10 known outputs: 10

(35)

Different types of formal specifications

Spec 1: f (x , y ) ≥ x ∧ f (x , y ) ≥ y ∧ (f (x , y ) = x ∨ f (x , y ) = y ) Spec 2: (in2 = 0 =⇒ f (in2) = 0) ∧ (in2 > 0 =⇒ f (in2) = in2+ 1) Spec 3: f (x , y + 1) = f (x , y ) + 1 ∧ f (x + 1, y ) = f (x , y ) + 1 ∧ (x = 0 ∧ y = 0 =⇒ f (x , y ) = 0)

(36)

Single-invocation property

In the constraints every call to synthesis-function has the same arguments.

Correctness of a function on a single input does not depend on function’s values on other inputs.

Can be checked by syntactic analysis.

Spec 1: single-invocation

f (x , y )≥ x ∧

f (x , y )≥ y ∧ (f (x , y )= x ∨ f (x , y )= y ) Invocations: (x , y )

(37)

Single-invocation property

In the constraints every call to synthesis-function has the same arguments.

Correctness of a function on a single input does not depend on function’s values on other inputs.

Can be checked by syntactic analysis.

Spec 3: multiple-invocation

f (x , y + 1)=f (x , y )+ 1 ∧

f (x + 1, y )=f (x , y )+ 1 ∧

(x = 0 ∧ y = 0 =⇒ f (x , y )= 0)

Invocations: (x , y + 1), (x + 1, y ), (x , y )

(38)

Single-output property

For a single input there is only one correct output.

Can be considered locally (single input) or globally (all inputs). Can be checked by a query to SMT solver.

Spec 1: single-output

f (x , y ) ≥ x ∧

f (x , y ) ≥ y ∧ (f (x , y ) = x ∨ f (x , y ) = y )

The maximum of two numbers is strictly defined as a concrete value. Compare this with f (x ) ≥ 0 property, for which there is infinitely many

(39)

Single-output property

For a single input there is only one correct output.

Can be considered locally (single input) or globally (all inputs). Can be checked by a query to SMT solver.

Spec 2: multiple-output

(in2 = 0 =⇒ f (in2) = 0) ∧

(in2 > 0 =⇒ f (in2) = in2+ 1)

For in2 < 0 outputs of f are undefined, which means that any integer is a

valid output.

(40)

Single-output property

Query to SMT solver (checking globally): (set-logic LIA)

(declare-fun out1 () Int) (declare-fun out2 () Int)

(define-fun max 1 ((x Int)(y Int)) Int out1) (define-fun max 2 ((x Int)(y Int)) Int out2) (declare-fun x () Int)

(declare-fun y () Int) (assert (>= (max 1 x y) x)) (assert (>= (max 1 x y) y))

(assert (or (= x (max 1 x y)) (= y (max 1 x y)))) (assert (>= (max 2 x y) x))

(assert (>= (max 2 x y) y))

(41)

Single-output property

Queries to SMT solver (checking locally): (set-logic LIA)

(declare-fun out () Int)

(define-fun max ((x Int)(y Int)) Int out) (define-fun x () Int (- 1))

(define-fun y () Int 0) (assert (>= (max x y) x)) (assert (>= (max x y) y))

(assert (or (= (max x y) x) (= (max x y) y))) ;

---; Added in the 2nd query, after 0 is found to be a correct output. ; If other correct output is found, then single-output does not hold. (assert (distinct out 0))

; ---(check-sat)

(get-value (out))

(42)

Different types of formal specifications

Spec 1: single-invocation,single-output −→ standard evaluation

f (x , y ) ≥ x ∧

f (x , y ) ≥ y ∧ (f (x , y ) = x ∨ f (x , y ) = y )

Spec 2: single-invocation,multiple-output −→ evaluation by solver

(in2 = 0 =⇒ f (in2) = 0) ∧

(in2 > 0 =⇒ f (in2) = in2+ 1)

Spec 3: multiple-invocation,single-output −→ evaluation by solver

f (x , y + 1) = f (x , y ) + 1 ∧

(43)

Standard evaluation

Program tree is recursively reduced to a constant by function applications on leaves

Obtained constant is then compared with the expected output Requires single-invocation for problem, and single-output for the considered input Example: (ite (>= x y) y (+ x y)) x → 3, y → 4 1. (ite (>= 3 4) 4 (+ 3 4)) 2. (ite (>= 3 4) 4 7) 3. 7 7 6= 4 (expected answer)

(44)

Evaluation by solver

Query to SMT solver: (set-logic LIA)

(define-fun max ((x Int)(y Int)) Int (ite (>= x y) y x)) (define-fun x () Int (- 1))

(define-fun y () Int 0) (assert (>= (max x y) x)) (assert (>= (max x y) y))

(assert (or (= (max x y) x) (= (max x y) y)))) (check-sat)

sat – correct behavior of the program on the test unsat – incorrect behavior

(45)

Plan of the Presentation

1 Introduction

6 Experiments

(46)

Benchmarks – LIA

Benchmarks marked with * were created by us.

Other were taken from SyGuS competition (www.sygus.org).

Name Arity Semantics

*CountPos 2, 3, 4 The number of positive arguments *IsSeries 3, 4 Do arguments form an arithmetic series? *IsSorted 4, 5 Are arguments in ascending order?

*Median 3 The median of arguments

*Range 3 The range of arguments

Max 4 The maximum of arguments

Search 2, 3, 4 The index of an argument among the other arguments Sum 2, 3, 4 The sum of arguments

(47)

LIA Example: CountPos2

(set-logic LIA)

(synth-fun countpositive2 ((a Int) (b Int)) Int) (declare-var a Int)

(declare-var b Int)

(constraint (=> (and (<= a 0) (<= b 0)) (= (countpositive2 a b) 0))) (constraint (=> (and (> a 0) (<= b 0)) (= (countpositive2 a b) 1))) (constraint (=> (and (<= a 0) (> b 0)) (= (countpositive2 a b) 1))) (constraint (=> (and (> a 0) (> b 0)) (= (countpositive2 a b) 2))) (check-synth)

(48)

Benchmarks – SLIA

Benchmarks are based on those from SyGuS competition. The test-based specification was converted to a formal one covering all inputs.

Name Arity Semantics

dr-name 1 Extract first name from full name and prepend it with ”Dr.” firstname 1 Extract first name from full name

initials 1 Extract initials name from full name lastname 1 Extract last name from full name

combine 2 Combine first and last name into full name

combine-2 2 Combine first and last name into first name followed by initial combine-3 2 Combine first and last name into initial followed by last name combine-4 2 Combine first and last name into last name followed by initial phone 1 Extract the first triplet of digits from a phone number phone-1 1 Extract the second triplet of digits from a phone number phone-2 1 Extract the third triplet of digits from a phone number phone-3 1 Put first three digits of a phone number in parentheses

(49)

SLIA Example: combine-4

(set-logic SLIA)

(synth-fun f ((name String)) String SLIA_GRAMMAR) (declare-var firstname String)

(declare-var lastname String)

; (constraint (= (f "Launa" "Withers") "Withers, L.")) ; (constraint (= (f "Lakenya" "Edison") "Edison, L.")) ; (constraint (= (f "Brendan" "Hage") "Hage, B."))

(constraint (= (f firstname lastname) (str.++ lastname ", " (str.at firstname 0) "." ) ))

(check-synth)

(50)

SLIA Example: dr-name

(set-logic SLIA)

(synth-fun f ((name String)) String SLIA_GRAMMAR)

(define-fun ithSplit((s String)(delim String)(i Int)) ...) (define-fun precond ((s String)) Bool (and

(distinct (str.indexof s " " 0) (- 1)) ...

(distinct (str.at s (- (str.len s) 1)) " "))) (declare-var s String)

; (constraint (= (f "Nancy FreeHafer") "Dr. Nancy")) ; (constraint (= (f "Mariya Sergienko") "Dr. Mariya")) ; (constraint (= (f "Jan Kotas") "Dr. Jan"))

(constraint (=> (precond s) (= (str.len (f s)) (+ (str.indexof s " " 0) 4))))

(constraint (=> (precond s) (= (ithSplit (f s) " " 0) "Dr.")))

(constraint (=> (precond s) (= (ithSplit (f s) " " 1) (ithSplit s " " 0) )))

(51)

Experiments – approaches

CDGP

GPR – our baseline, which adds random tests instead of those found during verification

EUSolver – enumerates solutions and tries to unify already collected short programs into longer ones

CVC4 – refutation-based SMT approach to synthesis

(52)

Experiments – configurations

EvolutionMode x Selection x TestsRatio

EvolutionMode: generational, steady state Selection: Tournament7, Lexicase

TestsRatio (CDGP): 0.0, 0.25, 0.5, 0.75, 1.0 TestsRatio (GPR): 0.75, 1.0

(53)

Experiments – parameters

Parameter Value

Number of runs 25

Population size 500

Maximum height of initial programs 5

Maximum height of trees inserted by mutation 5

Maximum height of programs in population 12

Maximum number of generations 100000

Maximum runtime in seconds 3600

Probability of mutation 0.5

Probability of crossover 0.5

Tournament size 7

(54)

Experiments LIA – Success rate

CDGP GPR EUSolver CVC4

Gener. SteadySt. Gener. SteadySt. Tour Lex Tour Lex Tour Lex Tour Lex

CountPos2 0.96 1.00 0.86 0.99 0.80 0.80 0.76 0.66 1.00 1.00 CountPos3 0.41 0.70 0.33 0.62 0.08 0.34 0.08 0.20 1.00 1.00 CountPos4 0.04 0.18 0.02 0.08 0.00 0.24 0.00 0.00 1.00 1.00 IsSeries3 0.50 0.46 0.50 0.46 0.06 0.02 0.04 0.02 1.00 1.00 IsSeries4 0.18 0.15 0.21 0.13 0.00 0.00 0.00 0.00 1.00 1.00 IsSorted4 0.87 0.98 0.82 0.91 0.00 0.28 0.04 0.16 1.00 1.00 IsSorted5 0.73 0.89 0.72 0.88 0.00 0.02 0.02 0.02 1.00 1.00 Max4 0.99 1.00 0.95 0.99 0.92 0.96 0.98 0.84 1.00 1.00 Median3 0.94 1.00 0.85 0.98 0.94 0.76 0.80 0.76 1.00 1.00 Range3 0.86 0.96 0.70 0.88 0.86 0.92 0.70 0.82 1.00 1.00 Search2 0.98 0.98 0.86 0.94 0.96 0.98 0.84 0.88 1.00 1.00 Search3 0.92 0.94 0.70 0.94 0.86 0.88 0.40 0.78 1.00 1.00 Search4 0.78 0.87 0.52 0.77 0.82 0.84 0.24 0.72 1.00 1.00 Sum2 0.99 1.00 0.98 0.99 1.00 0.88 1.00 0.92 1.00 1.00 Sum3 0.60 0.91 0.53 0.88 0.48 0.66 0.34 0.42 1.00 1.00

(55)

Experiments LIA – Success rate (2)

Gener. SteadySt. Gener. SteadySt.

0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.75 1.0 0.75 1.0 CountPos2 0.96 0.96 0.98 1.00 1.00 0.86 0.92 1.00 1.00 0.84 0.82 0.78 0.80 0.62 1.00 1.00 CountPos3 0.30 0.36 0.72 0.74 0.66 0.36 0.34 0.64 0.74 0.30 0.22 0.20 0.16 0.12 1.00 1.00 CountPos4 0.06 0.08 0.10 0.18 0.12 0.02 0.02 0.08 0.10 0.02 0.06 0.18 0.00 0.00 1.00 1.00 IsSeries3 0.20 0.22 0.34 0.72 0.90 0.36 0.24 0.32 0.84 0.66 0.00 0.08 0.00 0.06 1.00 1.00 IsSeries4 0.06 0.04 0.10 0.34 0.28 0.04 0.04 0.04 0.40 0.32 0.00 0.00 0.00 0.00 1.00 1.00 IsSorted4 0.72 0.98 0.94 1.00 1.00 0.80 0.90 0.94 0.98 0.72 0.06 0.22 0.04 0.16 1.00 1.00 IsSorted5 0.54 0.78 0.88 1.00 0.84 0.64 0.78 0.86 0.96 0.76 0.00 0.02 0.00 0.04 1.00 1.00 Max4 1.00 0.98 1.00 1.00 1.00 1.00 0.98 0.98 1.00 0.90 0.90 0.98 0.92 0.90 1.00 1.00 Median3 0.96 0.92 0.98 1.00 0.98 0.80 0.92 0.90 1.00 0.94 0.88 0.82 0.76 0.80 1.00 1.00 Range3 0.80 0.90 0.90 1.00 0.96 0.78 0.68 0.88 0.98 0.64 0.80 0.98 0.78 0.74 1.00 1.00 Search2 0.96 0.98 1.00 1.00 0.98 0.92 0.84 0.94 1.00 0.80 0.94 1.00 0.90 0.82 1.00 1.00 Search3 0.86 0.90 0.92 1.00 0.96 0.68 0.68 0.92 0.98 0.84 0.90 0.84 0.68 0.50 1.00 1.00 Search4 0.70 0.74 0.90 0.94 0.84 0.52 0.56 0.78 0.88 0.48 0.86 0.80 0.52 0.44 1.00 1.00 Sum2 1.00 1.00 1.00 1.00 0.98 0.96 1.00 1.00 1.00 0.98 0.90 0.98 0.94 0.98 1.00 1.00 Sum3 0.48 0.58 0.90 1.00 0.82 0.46 0.54 0.80 1.00 0.72 0.36 0.78 0.18 0.58 1.00 1.00 Sum4 0.08 0.10 0.04 0.70 0.18 0.04 0.02 0.28 0.58 0.22 0.02 0.20 0.06 0.08 1.00 1.00 All 0.60 0.66 0.73 0.85 0.78 0.58 0.59 0.71 0.84 0.63 0.48 0.55 0.42 0.43 1.00 1.00

(56)

Experiments LIA – Average runtime

Tour Lex Tour Lex Tour Lex Tour Lex

CountPos2 383 128 725 158 1092 1133 1314 1426 0.3 0.0 CountPos3 2727 1866 2815 2043 3490 2687 3373 3155 0.5 0.0 CountPos4 3528 3273 3577 3428 3600 3092 3600 3600 1.5 0.0 IsSeries3 2218 2489 2104 2346 3498 3546 3553 3550 0.2 0.0 IsSeries4 3131 3306 3079 3253 3600 3600 3600 3600 0.2 0.0 IsSorted4 1087 440 1071 696 3600 2861 3480 3171 0.2 0.0 IsSorted5 1710 1129 1506 990 3600 3565 3555 3555 0.3 0.0 Max4 420 94 297 208 809 287 423 1046 0.4 0.0 Median3 732 302 1018 475 539 1065 1204 1237 0.4 0.0 Range3 1186 643 1489 1031 851 622 1537 962 1.2 0.0 Search2 270 245 841 360 281 141 741 450 0.2 0.0 Search3 839 577 1534 499 886 584 2330 1195 0.3 0.0 Search4 1647 1098 2126 1340 1461 1144 2882 1315 0.4 0.0 Sum2 380 78 301 111 182 598 46 425 0.2 0.0 Sum3 2028 1076 2169 1222 2510 1837 2558 2600 0.3 0.0 Sum4 3230 3097 3182 3097 3509 3277 3539 3491 0.3 0.0

(57)

Experiments LIA – Average runtime (2)

Gener. SteadySt. Gener. SteadySt. 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.75 1.0 0.75 1.0 CountPos2 446 388 289 91 62 655 454 140 106 853 1089 1136 1072 1669 0.3 0.0 CountPos3 2961 3114 2036 1458 1913 2758 2823 2200 1512 2851 2991 3185 3244 3284 0.5 0.0 CountPos4 3521 3457 3366 3232 3427 3547 3547 3424 3415 3580 3479 3213 3600 3600 1.5 0.0 IsSeries3 3156 3176 2863 1611 962 2735 2974 2996 904 1516 3600 3444 3600 3503 0.2 0.0 IsSeries4 3468 3524 3339 2825 2935 3510 3549 3497 2570 2703 3600 3600 3600 3600 0.2 0.0 IsSorted4 1547 880 892 295 202 1302 1052 654 261 1149 3552 2909 3527 3125 0.2 0.0 IsSorted5 2362 1848 1275 532 1079 1919 1716 1013 347 1245 3600 3565 3600 3511 0.3 0.0 Max4 385 387 221 115 175 275 197 272 50 470 692 404 844 624 0.4 0.0 Median3 899 701 485 145 355 1332 890 833 77 601 699 905 1299 1143 0.4 0.0 Range3 1568 1178 793 285 748 1612 1718 1189 296 1484 1116 356 1273 1226 1.2 0.0 Search2 477 275 205 149 181 656 981 351 94 922 365 58 493 698 0.2 0.0 Search3 1229 1026 800 98 386 1534 1643 593 269 1046 746 724 1499 2027 0.3 0.0 Search4 1871 1836 1254 796 1108 2216 2272 1330 777 2071 1524 1081 2019 2177 0.4 0.0 Sum2 342 458 153 58 132 375 265 133 23 236 565 216 278 193 0.2 0.0 Sum3 2371 2401 1382 358 1248 2626 2222 1478 425 1724 2897 1450 3203 1955 0.3 0.0 Sum4 3556 3458 3522 2025 3258 3544 3551 3104 2337 3162 3539 3247 3542 3488 0.3 0.0 All 1885 1757 1430 880 1136 1912 1866 1450 842 1601 2128 1843 2293 2239 0.4 0.0

(58)

Experiments LIA – Average generated tests (2)

CDGP GPR

0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 0.75 1.0 0.75 1.0 CountPos2 1338 1225 929 321 100 1291 1080 609 185 29 4997 1010 3792 799 CountPos3 6103 5941 3608 1473 117 4611 4490 3541 1301 54 8597 1048 7001 348 CountPos4 8774 8112 5449 2677 129 7566 6504 5650 2028 55 4878 737 5149 53 IsSeries3 6528 5292 4578 2388 145 5137 4610 3839 1311 57 27150 825 21840 508 IsSeries4 7268 6683 5728 4787 190 6728 5694 4857 3758 83 25856 21642 22119 18852 IsSorted4 3976 3508 3389 1551 131 3545 3355 2254 1096 64 30279 1054 23544 445 IsSorted5 5698 4701 4092 2180 164 4454 3964 2964 1527 85 25984 888 19716 664 Max4 2274 1900 1601 879 160 1844 1455 1440 568 45 3941 668 2980 338 Median3 3252 3022 2390 779 140 3929 2986 2671 426 46 3001 1233 4702 424 Range3 3896 3531 2067 767 110 3838 3611 2777 506 30 4672 700 3927 271 Search2 2136 1464 1105 374 88 2483 2688 860 204 35 1982 522 1443 334 Search3 4652 3680 2477 357 102 4377 4042 1754 278 38 3378 807 2995 206 Search4 5893 5961 3506 826 96 5451 5508 3080 315 32 5389 728 2377 83 Sum2 1277 892 599 354 108 1242 777 554 161 21 3215 563 1483 239 Sum3 4932 4510 3479 1072 145 5074 4162 3101 809 56 14962 811 11732 571 Sum4 7783 6941 6505 2548 151 7023 5612 5457 2244 66 10773 792 8904 264 All 4736 4210 3219 1458 130 4287 3784 2838 1045 50 11191 2127 8981 1525

(59)

Experiments SLIA – Success rate

CDGP CVC4 1.5 CVC4 head

Gener. SteadySt.

Tour Lex Tour Lex

dr-name 0.36 0.40 0.58 0.53 0.00 0.00 firstname 0.94 0.89 0.98 0.86 0.00 0.00 initials 0.07 0.12 0.31 0.37 0.00 0.00 lastname 0.62 0.70 0.69 0.60 0.00 0.00 name-combine 1.00 1.00 1.00 1.00 0.00 1.00 name-combine-2 0.99 0.99 0.96 1.00 0.00 0.00 name-combine-3 0.98 0.94 0.95 0.98 0.00 1.00 name-combine-4 0.96 0.86 0.93 0.96 0.00 0.00 phone 0.99 1.00 0.99 1.00 0.00 0.00 phone-1 0.98 0.99 1.00 0.99 0.00 0.00 phone-2 0.85 0.88 0.96 0.94 0.00 0.00 phone-3 0.00 0.01 0.14 0.13 0.00 0.00 phone-4 0.00 0.06 0.17 0.15 0.00 0.00 All 0.67 0.68 0.74 0.73 0.00 0.15

(60)

Experiments SLIA – Success rate (2)

CDGP CVC4 1.5 CVC4 head Gener. SteadySt. 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 dr-name 0.02 0.22 0.46 0.62 0.58 0.06 0.50 0.66 0.70 0.84 0.00 0.00 firstname 0.82 0.88 0.94 0.96 0.98 0.80 0.92 0.90 1.00 0.96 0.00 0.00 initials 0.00 0.04 0.06 0.12 0.26 0.04 0.20 0.38 0.58 0.50 0.00 0.00 lastname 0.28 0.48 0.76 0.92 0.86 0.36 0.46 0.72 0.84 0.84 0.00 0.00 name-combine 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 1.00 name-combine-2 1.00 0.98 1.00 1.00 0.98 0.94 0.96 1.00 1.00 1.00 0.00 0.00 name-combine-3 0.94 0.98 0.94 0.98 0.96 0.90 0.96 0.98 1.00 1.00 0.00 1.00 name-combine-4 0.64 0.96 0.94 1.00 1.00 0.84 0.94 0.98 0.96 1.00 0.00 0.00 phone 1.00 0.98 1.00 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.00 0.00 phone-1 0.96 0.96 1.00 1.00 1.00 1.00 0.98 1.00 1.00 1.00 0.00 0.00 phone-2 0.68 0.76 0.92 0.98 0.98 0.92 0.86 0.98 1.00 1.00 0.00 0.00 phone-3 0.00 0.00 0.02 0.00 0.00 0.00 0.18 0.14 0.24 0.12 0.00 0.00 phone-4 0.00 0.00 0.06 0.04 0.06 0.00 0.14 0.18 0.30 0.18 0.00 0.00 All 0.56 0.63 0.70 0.74 0.74 0.60 0.70 0.76 0.82 0.80 0.00 0.15

(61)

Experiments SLIA – Average runtime

CDGP CVC4 1.5 CVC4 head

Gener. SteadySt.

Tour Lex Tour Lex

dr-name 2553 2503 2024 2210 3600 125 firstname 634 912 460 945 3600 0.2 initials 3455 3423 2726 2749 3600 0.5 lastname 1752 1661 1418 1843 3600 1.3 name-combine 81 83 35 51 3600 0.6 name-combine-2 133 100 224 107 3600 3600 name-combine-3 415 565 358 307 3600 2618 name-combine-4 511 929 551 571 3600 3600 phone 542 317 115 75 3600 3600 phone-1 676 399 85 125 3600 3600 phone-2 1281 1206 434 562 3600 3600 phone-3 3600 3596 3179 3271 3600 3600 phone-4 3600 3457 3129 3215 3600 3600 All 1479 1473 1134 1233 3600 2149

(62)

Experiments SLIA – Average runtime (2)

CDGP CVC4 1.5 CVC4 head Gener. SteadySt. 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 dr-name 3585 3005 2328 1846 1875 3532 2412 1753 1632 1254 3600 125 firstname 1402 1009 680 442 333 1499 854 699 212 249 3600 0.2 initials 3600 3528 3494 3413 3161 3557 3290 2659 2009 2174 3600 0.5 lastname 3052 2366 1427 808 880 2589 2481 1375 790 920 3600 1.3 name-combine 175 87 53 47 48 100 57 32 9.5 17 3600 0.6 name-combine-2 167 200 36 73 106 369 321 99 21 18 3600 3600 name-combine-3 951 520 483 198 299 871 456 207 61 66 3600 2618 name-combine-4 2223 545 396 212 223 1718 523 259 220 83 3600 3600 phone 564 660 331 302 291 183 93 184 7.8 5.4 3600 3600 phone-1 837 882 396 317 257 267 150 39 61 7.3 3600 3600 phone-2 2015 2002 892 635 674 1222 972 222 46 29 3600 3600 phone-3 3600 3600 3590 3600 3600 3600 3131 3184 2951 3260 3600 3600 phone-4 3600 3600 3452 3533 3457 3600 3154 3149 2863 3093 3600 3600 All 1982 1693 1351 1187 1170 1777 1376 1066 837 860 3600 2149

(63)

Experiments SLIA – Average generated tests (2)

CDGP Gener. SteadySt. 0.0 0.25 0.5 0.75 1.0 0.0 0.25 0.5 0.75 1.0 dr-name 362 162 37 32 24 336 125 32 7 5 firstname 150 83 39 29 28 136 58 18 4 4 initials 269 103 58 39 30 285 134 37 14 7 lastname 238 112 43 28 25 177 75 22 11 6 name-combine 69 45 40 39 38 40 18 10 4 4 name-combine-2 122 87 40 32 33 158 86 42 13 12 name-combine-3 189 80 56 36 38 171 58 23 11 6 name-combine-4 349 87 47 33 31 336 58 31 11 6 phone 82 90 41 39 37 36 16 17 3 2 phone-1 93 102 45 39 35 46 26 5 3 2 phone-2 248 247 87 51 46 185 138 26 8 6 phone-3 150 145 17 12 13 129 74 5 2 2 phone-4 144 117 11 13 12 119 44 4 2 2 All 190 112 43 32 30 166 70 21 7 5

(64)

Experiments

Average ranks for Friedman test (LIA):

GL75 SL75 GT75 ST75 GL5 GL1 GT1 GL25 SL5 GL0 SL1 GT5 SL0 ST5

4.6 5.1 5.2 6.9 7.5 8.3 9.0 9.5 10.4 11.2 11.7 12.6 12.7 14.2

SL25 GPRGL1 GPRGT1 GT25 GT0 GPRGL75 ST25 ST1 GPRGT75 ST0 GPRST1 GPRSL75 GPRSL1 GPRST75 14.5 16.6 17.1 17.8 18.8 19.2 20.4 20.7 21.0 21.3 21.8 22.2 22.4 23.0

Post-hoc analysis using symmetry test (described by Hollander, et al.) All CDGP configurations with q = 0.75 are better than all GPR configurations (p < 0.05) (except for ST75)

(65)

Experiments

Average ranks for Friedman test (SLIA):

ST1 5.4 GL25 12.7 SL75 5.8 SL0 13.6 ST75 6.2 GT25 15.3 SL1 7.0 ST0 15.3 ST5 7.5 GT0 15.3 GL75 8.6 GL0 15.8 GT1 9.0 SL5 9.3 GL1 9.4 GT75 9.5 ST25 10.3 GT5 10.7 GL5 11.0 SL25 12.1

Post-hoc analysis using symmetry test (described by Hollander, et al.) Most of pairwise differences are statistically insignificant

(66)

(67)

Bibliography I

[1] Krzysztof Krawiec, Iwo Błądek, and Jerry Swan.

“Counterexample-driven Genetic Programming”.In: Proceedings of

the Genetic and Evolutionary Computation Conference. GECCO ’17.

Berlin, Germany: ACM, 2017, pp. 953–960.