Coevolutionary Reinforcement Learning for Othello and Small-Board Go

(1)

for Othello and Small-Board Go

Marcin Szubert

Wojciech Ja´skowski

Krzysztof Krawiec

Institute of Computing Science

Poznan University of Technology

(2)

Introduction Methods Experimental Results Summary and Conclusions

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(3)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(4)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(5)

Inspiration — Samuel’s Checkers Player

“Some Studies in Machine Learning Using the Game of Checkers”

– A. L. Samuel, IBM Journal of Research and Development, 1959

Games provide a convenient vehicle for the development of learning

procedures as contrasted with a problem taken from life, since many of the

complications of detail are removed.

Arthur Lee Samuel

MAX MIN MIN MAX MIN MIN Previously visited state

Polynomial value computed Ply number 1

2

Learning methods based on

Shannon’s minimax procedure:

Rote Learning – handcrafted

scoring polynomial

Learning by Generalization –

polynomial modification

(6)

Inspiration — Different Views of Samuel’s Work

Samuel was one of the first to make effective use

of heuristic search methods and of what we would

now call temporal difference learning.

Richard Sutton & Andrew Barto

To elaborate the analogy with evolutionary

computation, Samuel’s procedure can be called a

coevolutionary algorithm with two populations of

size 1, asynchronous population updates, and

domain-specific, deterministic variation operators.

Anthony Bucci

(7)

Inspiration — Lucas & Runarsson

“Temporal Difference Learning Versus Co-Evolution for Acquiring

Othello Position Evaluation” – S. Lucas and T. Runarsson, IEEE

Symposium on Computational Intelligence and Games, 2006

Why is this work interesting?

Learning with little a priori knowledge

Comparison of coevolutionary algorithm

and non-evolutionary approach

Complementary advantages of these

methods revealed by experimental results

(8)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(9)

Motivation

Past observations

Temporal Difference Learning (TDL) is much faster

Coevolutionary Learning (CEL) can eventually produce better

strategies if parameters are tuned properly

Is it possible to combine the advantages of TDL and CEL

in a single hybrid algorithm?

We propose Coevolutionary Temporal Difference Learning

(CTDL) method and evaluate it on board games of Othello

and Small-Board Go.

We also incorporate a simple Hall of Fame (HoF) archive.

(10)

Objectives

Objective: Learn game-playing strategy represented by the

weights of a weighted piece counter (WPC).

1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00 -0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 -0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00

f (b) =

8×8

X

i =1

w

i

b

i

The emphasis throughout all of these studies has been on learning techniques. The

temptation to improve the machine’s game by giving it standard openings or other

man-generated knowledge of playing techniques has been consistently resisted.

Arthur Lee Samuel

(11)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(12)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(13)

Coevolutionary Algorithm

Coevolution in nature

Reciprocally induced evolutionary change between two or more

interacting species or populations.

The simplest variant of one-population generational

competitive coevolutionary algorithm:

18

2 Coevolution

Algorithm 1 Basic scheme of a generational evolutionary algorithm

1:

P ← createRandomPopulation()

2:

evaluatePopulation(P)

3:

while ¬terminationCondition() do

4:

S ← selectParents(P)

5:

P ← recombineAndMutate(S)

6:

evaluatePopulation(P)

7:

end while

8:

return getFittestIndividual(P)

The family of EA is composed of a few methods that differ slightly in technical

de-tails, but all match the basic scheme presented in Algorithm 1. The most important

difference between these methods concerns so called representation which defines a

mapping from phenotypes onto a set of genotypes and specifies what data structures

are employed in this encoding. Phenotypes are objects forming solutions to the

original problem, i.e., points of the problem space of possible solutions. Genotypes,

on the other hand, are used to denote points in the evolutionary search space which

are subject to genetic operations. The process of genotype-phenotype decoding is

intended to model natural phenomenon of embryogenesis. More detailed description

of these terms can be found in [Weise 09].

Returning to different dialects of EA, candidate solutions are represented

typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)

[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite

state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in

Genetic Programming (GP) [Koza 92]. A certain representation might be preferable

if it makes encoding solutions to a given problem more natural. Obviously, genetic

operations of recombination and mutation must be adapted to chosen

representa-tion. For example, crossover in GP is usually based on exchanging subtrees between

combined individuals.

The most significant advantage of EA lies in their flexibility and adaptability to

the given task. This may be explained by their metaheuristic character of “black

box” that makes only few assumptions about the underlying objective function which

is the subject of optimization. EA are claimed to be robust problem solvers showing

roughly good performance over a wide range of problems, as reported by Goldberg

[Goldberg 89]. Especially the combination of EA with problem-specific heuristics

including local-search based techniques, often make possible highly efficient

opti-mization algorithms for many areas of application. Such hybridization of EA is

getting popular due to their capabilities in handling real-world problems involving

noisy environment, imprecision or uncertainty. The latest state-of-the-art

method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].

What mainly distinguishes coevolution from standard EA?

context-sensitive evaluation phase

no objective fitness =⇒ no guarantee of progress

(14)

Coevolutionary Fitness Assignment

Common interaction patterns:

12 2 Coevolution

Algorithm 1 Basic scheme of a generational evolutionary algorithm

P ← createRandomPopulation()

evaluatePopulation(P)

while ¬terminationCondition() do

S ← selectParents(P)

P ← recombineAndMutate(S)

evaluatePopulation(P)

end while

return getFittestIndividual(P)

The family of EA is composed of a few methods that differ slightly in technical

details, but all can be realized with the basic scheme presented in Algorithm 1. The

most important difference between these methods concerns so called representation

which defines a mapping from phenotypes onto a set of genotypes and specifies what

data structures are employed in this encoding. Phenotypes are objects forming

so-lutions to the original problem, i.e. points of the problem space of possible soso-lutions.

Genotypes, on the other hand, are used to denote points in the evolutionary search

space which are subject to genetic operations. The process of genotype-phenotype

decoding is intended to model natural phenomenon of embryogenesis. More detailed

description of these abstractions can be found in [Weise 09].

Returning to different dialects of EA, candidate solutions are represented

typi-cally by strings over a finite alphabet (usually binary) in Genetic Algorithms (GA)

[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite

state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in

Genetic Programming (GP) [Koza 92]. A certain representation might be

prefer-able if it makes encoding solutions to particular problem more natural. Obviously,

genetic operations of recombination and mutation must be adapted to choosen

rep-resentation. For example, crossover in GP is usually based on exchanging of subtrees

between combined individuals.

The most significant advantage of EA lies in their flexibility and adaptability to

the given task. This may be explained by their metaheuristic character of “black

box” that makes only few assumptions about the underlying objective function which

is a subject to optimization. Another benefit is that EA are claimed to be robust

problem solvers showing roughly good performance over a wide range of problems,

as reported by Goldberg [Goldberg 89].

Especially the combination of EA with problem-specific heuristics including

locsearch based techniques, often make possible highly efficient optimization

al-gorithms for many areas of application. Such hybridization of EA is getting popular

due to their capabilities in handling real-world problems involving noisy

environ-ment, imprecision or uncertainty. The latest state-of-the-art methodologies in

Hy-brid Evolutionary Algorithms are described in [Grosan 07].

1

14 2 Coevolution

P

(a) Round-robin in one population - ×

P

1 P

₂

(b) Round-robin in two populations

Fig. 2.1: Round-robin tournament interaction scheme

ing competitions between accordingly coupled pairs is the dominant computational

requirement of the evolution process, the competition topology is an important

con-sideration. Different types of topologies were proposed and discussed by Angeline

and Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].

Round-robin tournament which is illustrated in Figure 2.1 is a common approach

resulting in the most accurate evaluation. In this pattern each member of each

population interact with every other individual which can serve as a partner. This

requires n(n − 1)/2 competitions in a single-population of P

1 members (as shown

in Figure 2.1a) and nm competitions in a two-population setup, where P

2 and

m

are s.However, such approach is computationally expensive, especially for large

populations. Therefore, more efficient patterns of interactions

Single Elimination Tournament (SET) Tournament interaction scheme is

illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary

algo-rithms with only one population (or solely inter-population). Alternatively, it can be

extended to be use However, basing on this concept an extension of inter-population

tournament could be designed.

2.2.3 Coevolution vs Evolution in Practice

A question arises: when and why shall we prefer coevolution rather than traditional

evolutionary algorithm. Machine learning. We will consider how EA can be used

for problem which naturally requires coevolution.

Moreover, this inconsistency of EA with natural evolution leads to a serious

problem if, in contrast to optimization, there is no objective function intrinsic to a

14 2 Coevolution

P

(a) Round-robin in one population - ×

P

1 P

₂

(b) Round-robin in two populations

Fig. 2.1: Round-robin tournament interaction scheme

ing competitions between accordingly coupled pairs is the dominant computational

requirement of the evolution process, the competition topology is an important

con-sideration. Different types of topologies were proposed and discussed by Angeline

and Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].

Round-robin tournament which is illustrated in Figure 2.1 is a common approach

resulting in the most accurate evaluation. In this pattern each member of each

population interact with every other individual which can serve as a partner. This

requires n(n − 1)/2 competitions in a single-population of P

1 members (as shown

in Figure 2.1a) and nm competitions in a two-population setup, where P

2 and

m

are s.However, such approach is computationally expensive, especially for large

populations. Therefore, more efficient patterns of interactions

Single Elimination Tournament (SET) Tournament interaction scheme is

illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary

algo-rithms with only one population (or solely inter-population). Alternatively, it can be

extended to be use However, basing on this concept an extension of inter-population

tournament could be designed.

2.2.3 Coevolution vs Evolution in Practice

A question arises: when and why shall we prefer coevolution rather than traditional

evolutionary algorithm. Machine learning. We will consider how EA can be used

for problem which naturally requires coevolution.

Moreover, this inconsistency of EA with natural evolution leads to a serious

problem if, in contrast to optimization, there is no objective function intrinsic to a

How to aggregate interaction results into single fitness value?

calculate sum of all interaction outcomes

use competitive fitness sharing

problem of measurement

(15)

Coevolutionary Archive

Maintaining historical players in the Hall of Fame (HoF)

archive for breeding and evaluating purposes.

Evaluation phase flowchart:

18 2 Coevolution

Algorithm 1 Basic scheme of a generational evolutionary algorithm

P ← createRandomPopulation()

evaluatePopulation(P)

while ¬terminationCondition() do

S ← selectParents(P)

P ← recombineAndMutate(S)

evaluatePopulation(P)

end while

return getFittestIndividual(P)

The family of EA is composed of a few methods that differ slightly in technical

de-tails, but all match the basic scheme presented in Algorithm 1. The most important

difference between these methods concerns so called representation which defines a

mapping from phenotypes onto a set of genotypes and specifies what data structures

are employed in this encoding. Phenotypes are objects forming solutions to the

original problem, i.e., points of the problem space of possible solutions. Genotypes,

on the other hand, are used to denote points in the evolutionary search space which

are subject to genetic operations. The process of genotype-phenotype decoding is

intended to model natural phenomenon of embryogenesis. More detailed description

of these terms can be found in [Weise 09].

Returning to different dialects of EA, candidate solutions are represented

typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)

[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite

state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in

Genetic Programming (GP) [Koza 92]. A certain representation might be preferable

if it makes encoding solutions to a given problem more natural. Obviously, genetic

operations of recombination and mutation must be adapted to chosen

representa-tion. For example, crossover in GP is usually based on exchanging subtrees between

combined individuals.

The most significant advantage of EA lies in their flexibility and adaptability to

the given task. This may be explained by their metaheuristic character of “black

box” that makes only few assumptions about the underlying objective function which

is the subject of optimization. EA are claimed to be robust problem solvers showing

roughly good performance over a wide range of problems, as reported by Goldberg

[Goldberg 89]. Especially the combination of EA with problem-specific heuristics

including local-search based techniques, often make possible highly efficient

opti-mization algorithms for many areas of application. Such hybridization of EA is

getting popular due to their capabilities in handling real-world problems involving

noisy environment, imprecision or uncertainty. The latest state-of-the-art

method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].

18 2 Coevolution

Algorithm 1 Basic scheme of a generational evolutionary algorithm

P ← createRandomPopulation()

evaluatePopulation(A)

while ¬terminationCondition() do

S ← selectParents(P)

P ← recombineAndMutate(S)

evaluatePopulation(P)

end while

return getFittestIndividual(P)

The family of EA is composed of a few methods that differ slightly in technical

de-tails, but all match the basic scheme presented in Algorithm 1. The most important

difference between these methods concerns so called representation which defines a

mapping from phenotypes onto a set of genotypes and specifies what data structures

are employed in this encoding. Phenotypes are objects forming solutions to the

original problem, i.e., points of the problem space of possible solutions. Genotypes,

on the other hand, are used to denote points in the evolutionary search space which

are subject to genetic operations. The process of genotype-phenotype decoding is

intended to model natural phenomenon of embryogenesis. More detailed description

of these terms can be found in [Weise 09].

Returning to different dialects of EA, candidate solutions are represented

typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)

[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite

state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in

Genetic Programming (GP) [Koza 92]. A certain representation might be preferable

if it makes encoding solutions to a given problem more natural. Obviously, genetic

operations of recombination and mutation must be adapted to chosen

representa-tion. For example, crossover in GP is usually based on exchanging subtrees between

combined individuals.

The most significant advantage of EA lies in their flexibility and adaptability to

the given task. This may be explained by their metaheuristic character of “black

box” that makes only few assumptions about the underlying objective function which

is the subject of optimization. EA are claimed to be robust problem solvers showing

roughly good performance over a wide range of problems, as reported by Goldberg

[Goldberg 89]. Especially the combination of EA with problem-specific heuristics

including local-search based techniques, often make possible highly efficient

opti-mization algorithms for many areas of application. Such hybridization of EA is

getting popular due to their capabilities in handling real-world problems involving

noisy environment, imprecision or uncertainty. The latest state-of-the-art

method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].

1. Play round robin tournament between population members

2. Randomly select archival individuals to act as opponents

3. Select the best-of-generation individual and add it to the archive

(16)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(17)

Successes of Reinforcement Learning

Reinforcement Learning ideas have been independently

validated in many application areas [Sutton, ICML 2009]

RL application areas

Process Control

23%

Other

8%

Finance

4%

Autonomic Computing

6% Traffic

6%

Robotics

13%

Resource Management

18%

Networking

21%

Survey by Csaba Szepesvari

of 77 recent application

papers, based on an IEEE.org

search for the keywords

“RL” and “application”

signal processing natural language processing web services brain-computer interfaces aircraft control engine control bio/chemical reactors sensor networks routing call admission control network resource management

power systems inventory control supply chains customer service

mobile robots, motion control, Robocup, vision stoplight control, trains, unmanned vehicles

load balancing memory management algorithm tuning option pricing asset management

(18)

The Reinforcement Learning Paradigm

Reinforcement Learning (RL)

Machine learning paradigm focused on solving problems in which

an agent interacts with an environment by taking actions and

receiving rewards at discrete time steps. The objective is to find

such a decision policy that maximizes cumulative reward.

Agent

Environment

2. action

a

_t

3. reward

r

_t

1. state

s

_t

4. learn

on the basis of

< s

_t

_{, a}

_t

_{, r}

_t

_{, s}

_t+1

>

In Othello:

agent =⇒ player

environment =⇒ game

state =⇒ board state

action =⇒ legal move

reward =⇒ game result

Coevolutionary Reinforcement Learning for Othello and Small-Board Go 18 / 42 M. Szubert, W.Ja´skowski, K.Krawiec

(19)

Key Ideas of Reinforcement Learning

Agent’s goal is to learn policy π : S 7→ A that maximizes the

expected return R

t

(the function of future rewards

r

t+1

, r

t+2

, ...

)

rewards define the goal of learning

cumulative discounted return R

t

=

P

∞

_k=0

γ

k

r

t+k

delayed rewards – temporal credit assignment problem

RL methods specify how the agent changes its

policy as a result of experience.

trial and error search

exploration-exploitation trade-off

All efficient methods for solving sequential

decision problems estimate value function as

an intermediate step V

π

(s) = E

π

[R

t

|s

t

= s]

!

V

evaluation improvement V "V! !"greedy(V)

*

V

!

*

(20)

Prediction Learning Problem

Experience-outcome sequence : s

1 , s

2 , s

3 , ..., s

T

; z

Sequence of predictions of z: V (s

1 ), V (s

2 ), V (s

3 ), ..., V (s

T

)

Supervised learning:

V (s

t

) := V (s

t

) + α[z − V (s

t

)]

road 30 35 40 45

Predicted

total

travel

time

leaving office exiting highway

2ndary home arrive

Situation

actual outcome

reach

car street home

Temporal difference learning:

V (s

t

) := V (s

t

)+α[V (s

t+1

)−V (s

t

)]

actual outcome Situation 30 35 40 45 Predicted total travel time road leaving office exiting highway

2ndary home arrive reach

car street home

Figures in slides 19-22 come from “Reinforcement Learning: An Introduction” by R. Sutton and A. Barto

(21)

Prediction Learning Problem =⇒ Policy Evaluation

Sample experience following a policy π : s

1 , r

1 , s

2 , r

2 , ..., s

T

, r

T

Sequence of estimates of V

π

(s

t

) :

V (s

1 ), V (s

2 ), ..., V (s

T

)

Monte-Carlo method:

V (s

t

) := V (s

t

) + α[R

t

− V (s

t

)]

R

t

=

T −t

X

k=0

γ

k

r

t+k

TD(0) method:

V (s

t

) := V (s

t

)+α[R

t

(1)

−V (s

t

)]

R

t

(1)

= r

t

+ γV (s

t+1

)

TD (1-step) 2-step 3-step n-step Monte Carlo

R

t

(n)

=

n−1

X

k=0

γ

k

r

t+k

+ γ

n

V (s

t+n

)

Coevolutionary Reinforcement Learning for Othello and Small-Board Go 21 / 42 M. Szubert, W.Ja´skowski, K.Krawiec

(22)

Eligibility Traces – TD(λ) Method

TD(λ) method:

R

t

(n)

=

n−1

X

k=0

γ

k

r

t+k

+ γ

n

V (s

t+n

)

R

t

(λ)

= (1 − λ)

∞

X

n=1

λ

n−1

R

t

(n)

V (s

t

) := V (s

t

) + α[R

t

(λ)

− V (s

t

)]

1!" weight given to the 3-step return

decay by "

weight given to actual, final return

t T Time Weight total area = 1

Monte-Carlo method:

V (s

t

) := V (s

t

)+α[R

t

−V (s

t

)]

R

t

=

T −t

X

k=0

γ

k

r

t+k

TD(0) method:

V (s

t

) := V (s

t

)+α[R

t

(1)

−V (s

t

)]

R

t

(1)

= r

t

+ γV (s

t+1

)

(23)

Gradient-Descent Temporal Difference Learning

Tabular TD(λ) does not address the issue of generalization

Value function V

t

represented as a parameterized functional

form with modifiable parameter vector w

t

Weights update rule for gradient-descent TD(λ):

w

t+1

:= w

t

+ α[R

t

(λ)

− V

t

(s

t

)]∇

w

t

V

t

(s

t

)

TDL applied to Othello/Go:

Position evaluation function f (b

t

) used to compute V

t

Learning through modifying WPC weights vector

Training data obtained by self-play

Impressive application – TD-Gammon by Gerald Tesauro

(24)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(25)

Coevolutionary Temporal Difference Learning

A hybrid of coevolutionary search with reinforcement learning that

works by interlacing one-population competitive coevolution with

temporal difference learning.

Population of players is subject to alternating learning phases:

TDL phase – each population member plays k games with itself

CEL phase – a single round-robin tournament between

population members (and, optionally, also archival individuals)

Each TDL-CEL cycle is followed by standard stages of fitness

assignment, selection and recombination

(26)

Lamarckian Coevolution Perspective

Memetic Algorithms combine population-based global search

with individual local improvement procedure

Lamarckian evolution theory:

concept of the inheritance of acquired traits

discredited in the field of biology

successfully implemented as Lamarckian EA

CTDL as Lamarckian Coevolutionary Algorithm

CEL performs exploration of the solution space

TDL is responsible for its exploitation by means of local

search performed on dynamic fitness landscape

(27)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(28)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(29)

Performance vs. Random Othello Player

0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 probability of winning games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CEL

(30)

Performance vs. Heuristic Player

0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 probability of winning games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CEL

(31)

Relative Performance Progress Over Time

4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 0 10 20 30 40 points in tournaments games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CEL

(32)

Best Evolved Othello WPC

1.04 -0.20 0.50 -0.01 0.02 0.46 -0.39 0.87 -0.51 -0.74 -0.19 -0.15 -0.18 -0.26 -0.56 -0.24 0.35 -0.13 0.08 -0.01 0.02 0.05 -0.28 0.50 0.02 -0.10 0.01 0.00 -0.01 0.01 -0.11 -0.10 -0.34 -0.11 0.01 -0.07 -0.03 0.00 -0.15 -0.05 0.61 -0.20 0.07 -0.01 -0.02 0.05 -0.23 0.42 -0.44 -0.61 -0.25 -0.12 -0.12 -0.13 -0.73 -0.21 0.99 -0.32 0.56 -0.02 -0.31 0.78 -0.54 1.10

(33)

Initialization and Evolution Progress

(34)

Outline

1 Introduction

Inspiration

Motivation and Objectives

2 Methods

Coevolution

Reinforcement Learning

Coevolutionary Reinforcement Learning

3 Experimental Results

CTDL for Othello

CTDL(λ) for Small-Board Go

4 Summary and Conclusions

(35)

TD(λ) Performance vs. Go Heuristic Player

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) λ = 0.0 λ = 0.4 λ = 0.8 λ = 0.9 λ = 0.95 λ = 1.0

(36)

Performance vs. Go Heuristic Player

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CEL

(37)

Performance vs. Average Liberty Player

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CEL

(38)

Relative Performance Progress Over Time

4000 5000 6000 7000 8000 9000 10000 11000 12000 0 4 8 12 16 20 points in tournaments games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CEL