for Othello and Small-Board Go
Marcin Szubert
Wojciech Ja´skowski
Krzysztof Krawiec
Institute of Computing Science
Poznan University of Technology
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Inspiration — Samuel’s Checkers Player
“Some Studies in Machine Learning Using the Game of Checkers”
– A. L. Samuel, IBM Journal of Research and Development, 1959
Games provide a convenient vehicle for the development of learning
procedures as contrasted with a problem taken from life, since many of the
complications of detail are removed.
Arthur Lee Samuel
MAX MIN MIN MAX MIN MIN Previously visited state
Polynomial value computed Ply number 1
2
Learning methods based on
Shannon’s minimax procedure:
Rote Learning – handcrafted
scoring polynomial
Learning by Generalization –
polynomial modification
Introduction Methods Experimental Results Summary and Conclusions
Inspiration — Different Views of Samuel’s Work
Samuel was one of the first to make effective use
of heuristic search methods and of what we would
now call temporal difference learning.
Richard Sutton & Andrew Barto
To elaborate the analogy with evolutionary
computation, Samuel’s procedure can be called a
coevolutionary algorithm with two populations of
size 1, asynchronous population updates, and
domain-specific, deterministic variation operators.
Anthony Bucci
Inspiration — Lucas & Runarsson
“Temporal Difference Learning Versus Co-Evolution for Acquiring
Othello Position Evaluation” – S. Lucas and T. Runarsson, IEEE
Symposium on Computational Intelligence and Games, 2006
Why is this work interesting?
Learning with little a priori knowledge
Comparison of coevolutionary algorithm
and non-evolutionary approach
Complementary advantages of these
methods revealed by experimental results
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Motivation
Past observations
Temporal Difference Learning (TDL) is much faster
Coevolutionary Learning (CEL) can eventually produce better
strategies if parameters are tuned properly
Is it possible to combine the advantages of TDL and CEL
in a single hybrid algorithm?
We propose Coevolutionary Temporal Difference Learning
(CTDL) method and evaluate it on board games of Othello
and Small-Board Go.
We also incorporate a simple Hall of Fame (HoF) archive.
Introduction Methods Experimental Results Summary and Conclusions
Objectives
Objective: Learn game-playing strategy represented by the
weights of a weighted piece counter (WPC).
1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00 -0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 -0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00
f (b) =
8×8
X
i =1
w
i
b
i
The emphasis throughout all of these studies has been on learning techniques. The
temptation to improve the machine’s game by giving it standard openings or other
man-generated knowledge of playing techniques has been consistently resisted.
Arthur Lee Samuel
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Coevolutionary Algorithm
Coevolution in nature
Reciprocally induced evolutionary change between two or more
interacting species or populations.
The simplest variant of one-population generational
competitive coevolutionary algorithm:
18
2 CoevolutionAlgorithm 1 Basic scheme of a generational evolutionary algorithm
1:
P ← createRandomPopulation()
2:evaluatePopulation(P)
3:while ¬terminationCondition() do
4:S ← selectParents(P)
5:P ← recombineAndMutate(S)
6:evaluatePopulation(P)
7:end while
8:return getFittestIndividual(P)
The family of EA is composed of a few methods that differ slightly in technical
de-tails, but all match the basic scheme presented in Algorithm 1. The most important
difference between these methods concerns so called representation which defines a
mapping from phenotypes onto a set of genotypes and specifies what data structures
are employed in this encoding. Phenotypes are objects forming solutions to the
original problem, i.e., points of the problem space of possible solutions. Genotypes,
on the other hand, are used to denote points in the evolutionary search space which
are subject to genetic operations. The process of genotype-phenotype decoding is
intended to model natural phenomenon of embryogenesis. More detailed description
of these terms can be found in [Weise 09].
Returning to different dialects of EA, candidate solutions are represented
typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)
[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite
state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in
Genetic Programming (GP) [Koza 92]. A certain representation might be preferable
if it makes encoding solutions to a given problem more natural. Obviously, genetic
operations of recombination and mutation must be adapted to chosen
representa-tion. For example, crossover in GP is usually based on exchanging subtrees between
combined individuals.
The most significant advantage of EA lies in their flexibility and adaptability to
the given task. This may be explained by their metaheuristic character of “black
box” that makes only few assumptions about the underlying objective function which
is the subject of optimization. EA are claimed to be robust problem solvers showing
roughly good performance over a wide range of problems, as reported by Goldberg
[Goldberg 89]. Especially the combination of EA with problem-specific heuristics
including local-search based techniques, often make possible highly efficient
opti-mization algorithms for many areas of application. Such hybridization of EA is
getting popular due to their capabilities in handling real-world problems involving
noisy environment, imprecision or uncertainty. The latest state-of-the-art
method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
What mainly distinguishes coevolution from standard EA?
context-sensitive evaluation phase
no objective fitness =⇒ no guarantee of progress
Introduction Methods Experimental Results Summary and Conclusions
Coevolutionary Fitness Assignment
Common interaction patterns:
12
2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ← createRandomPopulation()
evaluatePopulation(P)
while ¬terminationCondition() do
S ← selectParents(P)
P ← recombineAndMutate(S)
evaluatePopulation(P)
end while
return getFittestIndividual(P)
The family of EA is composed of a few methods that differ slightly in technical
details, but all can be realized with the basic scheme presented in Algorithm 1. The
most important difference between these methods concerns so called representation
which defines a mapping from phenotypes onto a set of genotypes and specifies what
data structures are employed in this encoding. Phenotypes are objects forming
so-lutions to the original problem, i.e. points of the problem space of possible soso-lutions.
Genotypes, on the other hand, are used to denote points in the evolutionary search
space which are subject to genetic operations. The process of genotype-phenotype
decoding is intended to model natural phenomenon of embryogenesis. More detailed
description of these abstractions can be found in [Weise 09].
Returning to different dialects of EA, candidate solutions are represented
typi-cally by strings over a finite alphabet (usually binary) in Genetic Algorithms (GA)
[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite
state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in
Genetic Programming (GP) [Koza 92]. A certain representation might be
prefer-able if it makes encoding solutions to particular problem more natural. Obviously,
genetic operations of recombination and mutation must be adapted to choosen
rep-resentation. For example, crossover in GP is usually based on exchanging of subtrees
between combined individuals.
The most significant advantage of EA lies in their flexibility and adaptability to
the given task. This may be explained by their metaheuristic character of “black
box” that makes only few assumptions about the underlying objective function which
is a subject to optimization. Another benefit is that EA are claimed to be robust
problem solvers showing roughly good performance over a wide range of problems,
as reported by Goldberg [Goldberg 89].
Especially the combination of EA with problem-specific heuristics including
locsearch based techniques, often make possible highly efficient optimization
al-gorithms for many areas of application. Such hybridization of EA is getting popular
due to their capabilities in handling real-world problems involving noisy
environ-ment, imprecision or uncertainty. The latest state-of-the-art methodologies in
Hy-brid Evolutionary Algorithms are described in [Grosan 07].
1
14
2 Coevolution
P
(a) Round-robin in one population - ×
P
1
P
2
(b) Round-robin in two populations
Fig. 2.1: Round-robin tournament interaction scheme
ing competitions between accordingly coupled pairs is the dominant computational
requirement of the evolution process, the competition topology is an important
con-sideration. Different types of topologies were proposed and discussed by Angeline
and Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].
Round-robin tournament which is illustrated in Figure 2.1 is a common approach
resulting in the most accurate evaluation. In this pattern each member of each
population interact with every other individual which can serve as a partner. This
requires n(n − 1)/2 competitions in a single-population of P
1
members (as shown
in Figure 2.1a) and nm competitions in a two-population setup, where P
2
and
m
are s.However, such approach is computationally expensive, especially for large
populations. Therefore, more efficient patterns of interactions
Single Elimination Tournament (SET) Tournament interaction scheme is
illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary
algo-rithms with only one population (or solely inter-population). Alternatively, it can be
extended to be use However, basing on this concept an extension of inter-population
tournament could be designed.
2.2.3 Coevolution vs Evolution in Practice
A question arises: when and why shall we prefer coevolution rather than traditional
evolutionary algorithm. Machine learning. We will consider how EA can be used
for problem which naturally requires coevolution.
Moreover, this inconsistency of EA with natural evolution leads to a serious
problem if, in contrast to optimization, there is no objective function intrinsic to a
14
2 Coevolution
P
(a) Round-robin in one population - ×
P
1
P
2
(b) Round-robin in two populations
Fig. 2.1: Round-robin tournament interaction scheme
ing competitions between accordingly coupled pairs is the dominant computational
requirement of the evolution process, the competition topology is an important
con-sideration. Different types of topologies were proposed and discussed by Angeline
and Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].
Round-robin tournament which is illustrated in Figure 2.1 is a common approach
resulting in the most accurate evaluation. In this pattern each member of each
population interact with every other individual which can serve as a partner. This
requires n(n − 1)/2 competitions in a single-population of P
1
members (as shown
in Figure 2.1a) and nm competitions in a two-population setup, where P
2
and
m
are s.However, such approach is computationally expensive, especially for large
populations. Therefore, more efficient patterns of interactions
Single Elimination Tournament (SET) Tournament interaction scheme is
illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary
algo-rithms with only one population (or solely inter-population). Alternatively, it can be
extended to be use However, basing on this concept an extension of inter-population
tournament could be designed.
2.2.3 Coevolution vs Evolution in Practice
A question arises: when and why shall we prefer coevolution rather than traditional
evolutionary algorithm. Machine learning. We will consider how EA can be used
for problem which naturally requires coevolution.
Moreover, this inconsistency of EA with natural evolution leads to a serious
problem if, in contrast to optimization, there is no objective function intrinsic to a
How to aggregate interaction results into single fitness value?
calculate sum of all interaction outcomes
use competitive fitness sharing
problem of measurement
Coevolutionary Archive
Maintaining historical players in the Hall of Fame (HoF)
archive for breeding and evaluating purposes.
Evaluation phase flowchart:
18
2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ← createRandomPopulation()
evaluatePopulation(P)
while ¬terminationCondition() do
S ← selectParents(P)
P ← recombineAndMutate(S)
evaluatePopulation(P)
end while
return getFittestIndividual(P)
The family of EA is composed of a few methods that differ slightly in technical
de-tails, but all match the basic scheme presented in Algorithm 1. The most important
difference between these methods concerns so called representation which defines a
mapping from phenotypes onto a set of genotypes and specifies what data structures
are employed in this encoding. Phenotypes are objects forming solutions to the
original problem, i.e., points of the problem space of possible solutions. Genotypes,
on the other hand, are used to denote points in the evolutionary search space which
are subject to genetic operations. The process of genotype-phenotype decoding is
intended to model natural phenomenon of embryogenesis. More detailed description
of these terms can be found in [Weise 09].
Returning to different dialects of EA, candidate solutions are represented
typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)
[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite
state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in
Genetic Programming (GP) [Koza 92]. A certain representation might be preferable
if it makes encoding solutions to a given problem more natural. Obviously, genetic
operations of recombination and mutation must be adapted to chosen
representa-tion. For example, crossover in GP is usually based on exchanging subtrees between
combined individuals.
The most significant advantage of EA lies in their flexibility and adaptability to
the given task. This may be explained by their metaheuristic character of “black
box” that makes only few assumptions about the underlying objective function which
is the subject of optimization. EA are claimed to be robust problem solvers showing
roughly good performance over a wide range of problems, as reported by Goldberg
[Goldberg 89]. Especially the combination of EA with problem-specific heuristics
including local-search based techniques, often make possible highly efficient
opti-mization algorithms for many areas of application. Such hybridization of EA is
getting popular due to their capabilities in handling real-world problems involving
noisy environment, imprecision or uncertainty. The latest state-of-the-art
method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
18
2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ← createRandomPopulation()
evaluatePopulation(A)
while ¬terminationCondition() do
S ← selectParents(P)
P ← recombineAndMutate(S)
evaluatePopulation(P)
end while
return getFittestIndividual(P)
The family of EA is composed of a few methods that differ slightly in technical
de-tails, but all match the basic scheme presented in Algorithm 1. The most important
difference between these methods concerns so called representation which defines a
mapping from phenotypes onto a set of genotypes and specifies what data structures
are employed in this encoding. Phenotypes are objects forming solutions to the
original problem, i.e., points of the problem space of possible solutions. Genotypes,
on the other hand, are used to denote points in the evolutionary search space which
are subject to genetic operations. The process of genotype-phenotype decoding is
intended to model natural phenomenon of embryogenesis. More detailed description
of these terms can be found in [Weise 09].
Returning to different dialects of EA, candidate solutions are represented
typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)
[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finite
state machines in classical Evolutionary Programming (EP) [Fogel 95] and trees in
Genetic Programming (GP) [Koza 92]. A certain representation might be preferable
if it makes encoding solutions to a given problem more natural. Obviously, genetic
operations of recombination and mutation must be adapted to chosen
representa-tion. For example, crossover in GP is usually based on exchanging subtrees between
combined individuals.
The most significant advantage of EA lies in their flexibility and adaptability to
the given task. This may be explained by their metaheuristic character of “black
box” that makes only few assumptions about the underlying objective function which
is the subject of optimization. EA are claimed to be robust problem solvers showing
roughly good performance over a wide range of problems, as reported by Goldberg
[Goldberg 89]. Especially the combination of EA with problem-specific heuristics
including local-search based techniques, often make possible highly efficient
opti-mization algorithms for many areas of application. Such hybridization of EA is
getting popular due to their capabilities in handling real-world problems involving
noisy environment, imprecision or uncertainty. The latest state-of-the-art
method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
1. Play round robin tournament between population members
2. Randomly select archival individuals to act as opponents
3. Select the best-of-generation individual and add it to the archive
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Successes of Reinforcement Learning
Reinforcement Learning ideas have been independently
validated in many application areas [Sutton, ICML 2009]
RL application areas
Process Control
23%
Other
8%
Finance
4%
Autonomic Computing
6% Traffic
6%
Robotics
13%
Resource Management
18%
Networking
21%
Survey by Csaba Szepesvari
of 77 recent application
papers, based on an IEEE.org
search for the keywords
“RL” and “application”
signal processing natural language processing web services brain-computer interfaces aircraft control engine control bio/chemical reactors sensor networks routing call admission control network resource management
power systems inventory control supply chains customer service
mobile robots, motion control, Robocup, vision stoplight control, trains, unmanned vehicles
load balancing memory management algorithm tuning option pricing asset management
Introduction Methods Experimental Results Summary and Conclusions
The Reinforcement Learning Paradigm
Reinforcement Learning (RL)
Machine learning paradigm focused on solving problems in which
an agent interacts with an environment by taking actions and
receiving rewards at discrete time steps. The objective is to find
such a decision policy that maximizes cumulative reward.
Agent
Environment
2.
action
a
t3. reward
r
t1.
state
s
t4. learn
on the basis of
< s
t, a
t, r
t, s
t+1>
In Othello:
agent =⇒ player
environment =⇒ game
state =⇒ board state
action =⇒ legal move
reward =⇒ game result
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 18 / 42 M. Szubert, W.Ja´skowski, K.KrawiecKey Ideas of Reinforcement Learning
Agent’s goal is to learn policy π : S 7→ A that maximizes the
expected return R
t
(the function of future rewards
r
t+1, r
t+2, ...
)
rewards define the goal of learning
cumulative discounted return R
t
=
P
∞
k=0
γ
k
r
t+k
delayed rewards – temporal credit assignment problem
RL methods specify how the agent changes its
policy as a result of experience.
trial and error search
exploration-exploitation trade-off
All efficient methods for solving sequential
decision problems estimate value function as
an intermediate step V
π
(s) = E
π
[R
t
|s
t
= s]
!
V
evaluation improvement V "V! !"greedy(V)*
V
!
*
Introduction Methods Experimental Results Summary and Conclusions
Prediction Learning Problem
Experience-outcome sequence : s
1
, s
2
, s
3
, ..., s
T
; z
Sequence of predictions of z: V (s
1
), V (s
2
), V (s
3
), ..., V (s
T
)
Supervised learning:
V (s
t
) := V (s
t
) + α[z − V (s
t
)]
road 30 35 40 45Predicted
total
travel
time
leaving office exiting highway2ndary home arrive
Situation
actual outcomereach
car street home
Temporal difference learning:
V (s
t
) := V (s
t
)+α[V (s
t+1
)−V (s
t
)]
actual outcome Situation 30 35 40 45 Predicted total travel time road leaving office exiting highway2ndary home arrive reach
car street home
Figures in slides 19-22 come from “Reinforcement Learning: An Introduction” by R. Sutton and A. Barto
Prediction Learning Problem =⇒ Policy Evaluation
Sample experience following a policy π : s
1
, r
1
, s
2
, r
2
, ..., s
T
, r
T
Sequence of estimates of V
π
(s
t
) :
V (s
1
), V (s
2
), ..., V (s
T
)
Monte-Carlo method:
V (s
t
) := V (s
t
) + α[R
t
− V (s
t
)]
R
t
=
T −t
X
k=0
γ
k
r
t+k
TD(0) method:
V (s
t
) := V (s
t
)+α[R
t
(1)
−V (s
t
)]
R
t
(1)
= r
t
+ γV (s
t+1
)
TD (1-step) 2-step 3-step n-step Monte Carlo
R
t
(n)
=
n−1
X
k=0
γ
k
r
t+k
+ γ
n
V (s
t+n
)
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 21 / 42 M. Szubert, W.Ja´skowski, K.KrawiecIntroduction Methods Experimental Results Summary and Conclusions
Eligibility Traces – TD(λ) Method
TD(λ) method:
R
t
(n)
=
n−1
X
k=0
γ
k
r
t+k
+ γ
n
V (s
t+n
)
R
t
(λ)
= (1 − λ)
∞
X
n=1
λ
n−1
R
t
(n)
V (s
t
) := V (s
t
) + α[R
t
(λ)
− V (s
t
)]
1!" weight given to the 3-step returndecay by "
weight given to actual, final return
t T Time Weight total area = 1
Monte-Carlo method:
V (s
t
) := V (s
t
)+α[R
t
−V (s
t
)]
R
t
=
T −t
X
k=0
γ
k
r
t+k
TD(0) method:
V (s
t
) := V (s
t
)+α[R
t
(1)
−V (s
t
)]
R
t
(1)
= r
t
+ γV (s
t+1
)
Gradient-Descent Temporal Difference Learning
Tabular TD(λ) does not address the issue of generalization
Value function V
t
represented as a parameterized functional
form with modifiable parameter vector w
t
Weights update rule for gradient-descent TD(λ):
w
t+1
:= w
t
+ α[R
t
(λ)
− V
t
(s
t
)]∇
w
tV
t
(s
t
)
TDL applied to Othello/Go:
Position evaluation function f (b
t
) used to compute V
t
Learning through modifying WPC weights vector
Training data obtained by self-play
Impressive application – TD-Gammon by Gerald Tesauro
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Coevolutionary Temporal Difference Learning
Coevolutionary Temporal Difference Learning
A hybrid of coevolutionary search with reinforcement learning that
works by interlacing one-population competitive coevolution with
temporal difference learning.
Population of players is subject to alternating learning phases:
TDL phase – each population member plays k games with itself
CEL phase – a single round-robin tournament between
population members (and, optionally, also archival individuals)
Each TDL-CEL cycle is followed by standard stages of fitness
assignment, selection and recombination
Introduction Methods Experimental Results Summary and Conclusions
Lamarckian Coevolution Perspective
Memetic Algorithms combine population-based global search
with individual local improvement procedure
Lamarckian evolution theory:
concept of the inheritance of acquired traits
discredited in the field of biology
successfully implemented as Lamarckian EA
CTDL as Lamarckian Coevolutionary Algorithm
CEL performs exploration of the solution space
TDL is responsible for its exploitation by means of local
search performed on dynamic fitness landscape
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Performance vs. Random Othello Player
0.5 0.6 0.7 0.8 0.9 0 10 20 30 40 probability of winning games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CELIntroduction Methods Experimental Results Summary and Conclusions
Performance vs. Heuristic Player
0 0.1 0.2 0.3 0.4 0.5 0 10 20 30 40 probability of winning games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CEL
Relative Performance Progress Over Time
4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 0 10 20 30 40 points in tournaments games played (x 100 000) CTDL + HoF CTDL TDL CEL + HoF CELIntroduction Methods Experimental Results Summary and Conclusions
Best Evolved Othello WPC
1.04 -0.20 0.50 -0.01 0.02 0.46 -0.39 0.87 -0.51 -0.74 -0.19 -0.15 -0.18 -0.26 -0.56 -0.24 0.35 -0.13 0.08 -0.01 0.02 0.05 -0.28 0.50 0.02 -0.10 0.01 0.00 -0.01 0.01 -0.11 -0.10 -0.34 -0.11 0.01 -0.07 -0.03 0.00 -0.15 -0.05 0.61 -0.20 0.07 -0.01 -0.02 0.05 -0.23 0.42 -0.44 -0.61 -0.25 -0.12 -0.12 -0.13 -0.73 -0.21 0.99 -0.32 0.56 -0.02 -0.31 0.78 -0.54 1.10
Initialization and Evolution Progress
Introduction Methods Experimental Results Summary and Conclusions
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
TD(λ) Performance vs. Go Heuristic Player
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) λ = 0.0 λ = 0.4 λ = 0.8 λ = 0.9 λ = 0.95 λ = 1.0Introduction Methods Experimental Results Summary and Conclusions
Performance vs. Go Heuristic Player
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CEL
Performance vs. Average Liberty Player
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 4 8 12 16 20 probability of winning games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CELIntroduction Methods Experimental Results Summary and Conclusions
Relative Performance Progress Over Time
4000 5000 6000 7000 8000 9000 10000 11000 12000 0 4 8 12 16 20 points in tournaments games played (x 100,000) CTDL + HoF CTDL TDL CEL + HoF CEL
Outline
1
Introduction
Inspiration
Motivation and Objectives
2
Methods
Coevolution
Reinforcement Learning
Coevolutionary Reinforcement Learning
3
Experimental Results
CTDL for Othello
CTDL(λ) for Small-Board Go
4
Summary and Conclusions
Introduction Methods Experimental Results Summary and Conclusions
Summary
CTDL benefits from mutually complementary characteristics
of both constituent methods.
Retains unsupervised character – useful when the knowledge
of the problem domain is unavailable or expensive to obtain.
Interesting biological interpretation of CTDL as an analogy to
Lamarckian Evolution Theory.
There is a need of further investigation of CTDL in the
context of other challenging problems.
Future Work
Employing a more complex learner architecture than a WPC
Using CTDL with two-population coevolution, with solutions
and tests bred separately (learner-teacher paradigm)
Including more advanced archive methods like LAPCA or IPCA
Changing TDL phase character to influence only evaluation
process (it would model the Baldwin Effect)
Introduction Methods Experimental Results Summary and Conclusions