Knowledge gradient exploration in online kernel-based LSPI

(1)

Knowledge Gradient Exploration in Online

Kernel-Based LSPI

Saba Yahyaa

Bernard Manderick

Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels

Abstract

We introduce online kernel-based LSPI (or least squares policy iteration) which combines feature of online LSPI and offline kernel-based LSPI. The knowledge gradient is used as exploration policy in both online LSPI and online kernel-based LSPI in order to compare their performance on 2 discrete Markov decision problems. Automatic feature selection in online kernel-based LSPI, which is a result of the approximate linear dependency based kernel sparsification, improves the performance when compared to online LSPI.

1 Introduction

A reinforcement learning (RL) agent has to learn to make optimal sequential decisions while interacting with its environment. At each time step, the agent takes an action and as a result the environment transits from the current state to the next one while the agent receives feedback from the environment in the form of a scalar reward.

The mapping from states to actions that specifies which actions to take in states is called a policy π and the goal of the agent is to find the optimal policy π∗, i.e. the one that maximises the total expected discounted reward, as soon as possible. The state-action value function Qπ(s, a) is defined as the total expected discounted reward obtained when the agent starts in state s, takes action a, and follows policy π thereafter. The optimal policy maximises these Qπ(s, a) values.

When the agent’s environment can be modelled as a Markov decision process (MDP) then the Bellman equations for the state-action value functions, one per state-action pair, can be written down and can be solved by algorithms like policy iteration or value iteration [10]. We refer to Section 2 for more details.

When no such model is available, the Bellman equations cannot be written down. Instead, the agent has to rely only on information collected while interacting with its environment. At each time step, the information collected consists of the current state, the action taken in that state, the reward obtained and the next state of the environment. The agent can either learn offline when first a batch of past experience is collected and subsequently used and reused to learn, or the agent can learn online when it tries to improve its behaviour at each time step based on the current information.

Fortunately, the optimal Q-values can still be determined using Q-learning [10] which represents the actions-value Qπ_{(s, a) as a lookup table and uses the agent’s experience to build the Q}π_{(s, a).}

Unfor-tunately, when the state and/or the action spaces are large finite or continuous space, the agent faces a challenge called the curse of dimensionality, since the memory space needed to store all the Q-values grows exponentially in the number of states and actions. Computing all Q-values becomes infeasible. To han-dle this challenge, function approximation methods have been introduced to approximate the Q-values, e.g. Lagoudakis and Parr proposed least squares policy iteration (LSPI) to find the optimal policy when no model of the environment is available [5]. LSPI is an example of both approximate policy iteration and offline learning. LSPI approximates the Q-values using a linear combination of predefined basis functions. The used predefined basis functions have a large impact on the performance of LSPI in terms of the number

(2)

of iterations that LSPI needs to converge to a policy, the probability that the converged policy is optimal, and the accuracy of the approximated Q-values.

To improve the accuracy of the approximated Q-values and to find a (near) optimal policy, Xu et al. have proposed kernel-based LSPI (KBLSPI), an example of offline approximated policy iteration that uses Mercer kernels to approximate Q-values [11]. Moreover, KBLSPI provides automatic feature selection by the kernel basis functions since it uses the approximate linear dependency sparsification method described in [13]. Bus¸oniu et al. adapted LSPI, which does offline learning, for online reinforcement learning and the result is called online LSPI [4]. A good online algorithm must quickly produce acceptable performance rather than at the end of the learning process as is the case in offline learning. In order to obtain good performance, an online algorithm has to find a proper balance between exploitation, i.e. using the collected information in the best possible way, and exploration, i.e. testing out the available alternatives [10]. Several exploration policies are available for that purpose and one of the most popular ones is -greedy exploration that selects with probability 1 − the action with the highest estimated Q-value and selects uniformly, randomly with probability one of the actions available in the current state. To get good performance, the parameter has to be tuned for each problem. To get rid of parameter tuning and to increase the performance of online LSPI, Yahyaa and Manderick proposed using knowledge gradient policy in the online-LSPI [16].

To improve the performance of online-LSPI and to get automatic feature selection, we propose online kernel-based LSPI and we use the knowledge gradient (KG) as an exploration policy. The rest of the paper is organised as follows: In Section 2 we present Markov decision processes, LSPI, the knowledge gradient pol-icy for online learning, kernel-based LSPI and the approximate linear dependency test. While in Section 3, we present the knowledge gradient policy in online kernel-based LSPI. In Section 4 we give the domains used in our experiments and our results. We conclude in Section 5.

2 Preliminaries

In this section, we discuss Markov decision processes (MDPs), online LSPI, the knowledge gradient explo-ration policy (KG), offline kernel-based LSPI (KBLSPI) and approximate linear dependency (ALD).

Markov Decision Process: A finite MDP is a 5-tuple (S, A, P, R, γ), where the state space S contains a finite number of states s and the action space A contains a finite number of actions a, the transition probabilities P (s, a, s0) give the conditional probabilities p(s0|s, a) that the environment transits to state s0

when the agent takes action a in state s, the reward distributions R(s, a, s0) give the expected immediate reward when the environment transits to state s0after taking action a in state s, and γ ∈ [0, 1) is the discount factor [8, 10].

A deterministic policy π : S → A determines which action a the agent takes in each state s. For the MDPs considered, there is always a deterministic optimal policy and so we can restrict the search process to such policies [8, 10]. By definition, the state-action value function Qπ_{(s, a) for a policy π gives the expected}

total discounted reward Eπ(P∞i=tγtrt) when the agent starts in state s, takes action a and follows policy π

thereafter. The goal of the agent is to find the optimal policy π∗, i.e. the one that maximizes Qπfor every state s and action a: π∗(s) = argmaxa∈AQ∗(s, a) where Q∗(s, a) = maxπQπ(s, a) is the optimal

state-action value function. For the MDPs considered, the Bellman equations for the state-state-action value function Qπare given by

Qπ(s, a) = R(s, a, s0) + γX

s0

P (s, a, s0)Qπ(s0, a0). (1) In Equation 1, the sum is taken over all states s0that can be reached from state s when action a is taken, and the action a0taken in next state s0is determined by the policy π, i.e. a0 = π(s0). If the MDP is completely known then algorithms such as value or policy iteration find the optimal policy π∗. Policy iteration starts with an initial policy π0, e.g. randomly selected, and repeats the next two steps until no further improvement

is found: 1) policy evaluation where the current policy πiis evaluated using Bellman equations 1 to calculate

the corresponding value function Qπi_{, and 2) policy improvement where this value function is used to find}

(3)

For finite MDPs, the action-value functions Qπ _{for a policy π can be represented by a lookup table}

of size |S| × |A|, one entry per state-action pair. However, when the state and/or action spaces are large, this approach becomes computationally infeasible due to the curse of dimensionality and one has to rely on function approximation instead. Moreover, the agent does not know the transition probabilities P (s, a, s0) and the reward distributions R(s, a, s0). Therefore, it must rely on information collected while interacting with the environment to learn the optimal policy. The information collected is a trajectory of samples of the form (st, at, rt, st+1) or (st, at, rt, st+1, at+1), where st, at, rt, st+1, and at+1, are the state, the action in

the state, the reward, the next state, and the next action in the next state, respectively. To overcome these problems, least squares policy iteration (LSPI) uses such samples to approximate the Qπ-values [5].

More recently, Bus¸oniu et al. have adapted LSPI so that it can work online [4] and Yahyaa, and Mand-erick have used the knowledge gradient (KG) exploration policy in this online LSPI [16].

Least Squares Policy Iteration:LSPI approximates the action-value Qπfor a policy π in a linear way [5]: ˆ Qπ(s, a; wπ) = n X i=1 φi(s, a)wiπ (2)

where n, n << |S × A|, is the number of basis functions, the weights (wiπ)ni=1are parameters to be learned

for each policy π, and {φi(s, a)}ni=1 is the set of predetermined basis functions. Let Φ be the basis matrix

of size |S × A| × n, where each row contains the values of all basis functions in one of the state-action pairs (s, a) and each column contains the values of one of the basis functions φiin all state-action pairs and let

wπ_{be a column vector of length n.}

Given a trajectory of length L of samples (st, at, rt, st+1)Lt=1. Offline-LSPI is an example of

approx-imated policy iteration and repeats the following two steps until no further improvement in the policy is obtained: 1) Approximate policy evaluation that approximates the state-action value function Qπ_{of the}

cur-rent policy π, and 2) Approximate policy improvement that derives from the curcur-rent estimated state-action value functions ˆQπa better policy π0, i.e. π0= argmaxa∈AQˆπ(s, a)

Using the least square error of the projected Bellman’s equation, Equation 1, the weight vector wπcan be approximated as follows [5]:

ˆ

Awπ= ˆb (3)

where ˆA is a matrix and ˆb is a vector. LSPI updates ˆA and ˆb from all available samples as follows: ˆ

At= ˆAt−1+ φ(st, at)[φ(st, at) − γφ(st+1, π(st+1))]T, ˆbt= ˆbt−1+ φ(st, at)rt (4)

where T is the transpose and rtis the immediate reward that is obtained at time step t. After iterating

over all collected samples, wπ_{can be found. Bus¸oniu et al. adapted offline-LSPI for online learning [4]. The}

changes with respect to the offline algorithm are twofold: First, online-LSPI updates the matrix ˆA and the vector ˆb after every few samples Kθobtained from the environment then it estimates the weight vector wπfor

the current policy π and computes the corresponding approximated ˆQ-function. Then online-LSPI derives an improved new learned policy π0, i.e. π0 = argmaxa∈AQˆπ(s, a) from it. When Kθ = 1 then

online-LSPI is called fully optimistic and when Kθ > 1 is a small value then online-LSPI is partially optimistic.

The second change is that online-LSPI needs an exploration policy and Bus¸oniu et al. proposed using the decaying -greedy exploration for that purpose. Yahyaa and Manderick proposed using knowledge gradient KG policy as an exploration policy instead of -greedy exploration policy and showed that the performance of the online-LSPI is increased, e.g. the average frequency that the learned policy is converged to the optimal policy [16]. Therefore, we are going to use KG policy in our algorithm and experiments.

Knowledge Gradient Exploration Policy: KG [2] assumes that the rewards of each action a are drawn according to a probability distribution and it takes normal distributions N (µa, σa) with mean µaand

stan-dard deviation σa. The current estimates, based on the rewards obtained so far, are denoted by ˆµa and

ˆ

σa. And, the root-mean-square error (RMSE) of the mean reward µa given n rewards resulting from

ac-tion a is given by ˆ¯σa = ˆσa/√n. The KG is an index strategy that determines for each action a the index

VKG_{(a), V}KG_{(a) = ˆ}_σ_¯

(4)

f (x) = φKG(x) + xΦKG(x), φKG(x) =1/ √

2π exp (− x2_/

2) is the density of the standard normal

distri-bution and ΦKG(x) =R x −∞φ(x

0_)dx0_{is its cumulative distribution, and ˆ}_¯_σ

a is the RMSE of the estimated

mean reward ˆµa. Then KG selects the next action according to:

aKG= argmax a∈A

ˆ

µa+ (γ/1 − γ)VKG(a) (5)

where the second term in the right hand side is the total discounted index of action a. KG prefers those actions about which comparatively little is known. These actions are the ones whose RMSE ˆσ¯aaround the

estimated mean reward ˆµais large. Thus, KG prefers an action a over its alternatives if its confidence in the

estimated mean reward ˆµa is low. KG is easy to implement and does not have parameters to be tuned like

-greedy policy [10]. For further reading how KG can be used in MDPs, we refer to [16].

Kernel Based Least Squares Policy Iteration:Kernel-based LSPI [12] is a kernelized version of offline-LSPI. Kernel-based LSPI uses Mercer’s kernels in the approximated policy evaluation and improvement [11]1_.

Given a trajectory of length L of samples that result from random initial policy. Offline kernel-based LSPI (KBLSPI) uses the approximate linear dependency based sparsification method to select a part of the data samples and consists a dictionary Dic elements set, i.e. Dic = {(si, ai)}|Dic|i=1 with the corresponding

kernel matrix KDic of size |Dic × Dic| [13]. Kernel-based LSPI repeats the following two steps: 1)

Approximate policy evaluation, kernel-based LSPI approximates the weight vector wπ_{for policy π. w}π_can

be calculated by constructing ˆA and ˆb from all available samples as follows:

ˆ

At = Aˆt−1+ k((st, at), j)[k((st, at), j) − γk((st+1, π(st+1)), j)]T (6)

ˆ

bt = ˆbt−1+ k((st, at), j)rt, j ∈ Dic, j = 1, 2, · · · , |Dic| (7)

where k(., .) is a kernel function between two points (a state-action pair and j2_{). After iterating for all the}

collected samples, wπ can be found and the approximated Qπ-values for policy π is the following linear combination:

ˆ

Qπ(s, a) = wπk((s, a), j), j ∈ Dic, j = 1, 2, · · · , |Dic| (8) 2) Approximate policy improvement, KBLSPI derives a new policy which is the greedy one, i.e. π0(s) = argmax_a∈AQˆπ_{(s, a). The above two steps are repeated until no change in the improved policy.}

Approximate Linear Dependency:Given a set of data samples D from a MDP, i.e. D = {z1, . . . , zL},

where zi is a state-action pair and the corresponding linear independent basis functions set Φ, i.e. Φ =

{φ(z1), · · · , φ(zL)}. Approximate linear dependency ALD method [13] over the data samples set D is to

find a subset Dic, i.e. Dic ⊂ D whose elements {zi}|Dic|i=1 and the corresponding basis functions are stored

in ΦDic, i.e. ΦDic⊂ Φ.

The data dictionary Dic is initially empty, i.e. Dic = {} and ALD is implemented by testing every basis function φ in Φ, one at time. If the basis function φ(zt) can not be approximated, within a predefined

accuracy v, by the linear combination of the basis functions of the elements that stored in Dict, then the

basis function φ(zt) will be added to ΦDictand ztwill be added to Dict, otherwise ztwill not be added to

Dictand φ(zt) will not be added to ΦDic. As a result, after the ALD test, the basis functions of ΦDiccan

approximate all the basis functions of Φ. At time step t, let Dict = {zj}

|Dict|

j=1 and the corresponding basis functions are stored in ΦDict, i.e.

ΦDict = {φ(zj)}

|Dict|

j=1 and ztis a given state-action pair at time t. The ALD test on the basis function φ(zt)

supposes that the basis functions are linearly dependent and uses least squares error to approximate φ(zt)

by all the basis functions of the elements in Dict, for more detail we refer to [1]. The least squares error is:

1_{Given a finite set of points, i.e. {z}

1, z2, · · · , zt}, where ziis the state-action pair, with the corresponding set of basis functions,

i.e. φ(z) : z → R. Mercer theorem states the kernel function K is a positive definite matrix, i.e. K(zi, zj) =< φ(zi), φ(zj) >. 2_{We use j to indicate the state-action pair z}

(5)

error = min c || |Dict| X j=1 cjφ(zj) − φ(zt)||2< v (9)

error = k(zt, zt) − kTDict(zt)ct, where ct= K

−1

DictkDict(zt), kDict(zt) = (k(j, zt))

|Dict|

j=1 (10)

If the error is larger than predefined accuracy v, then zt will be added to the dictionary elements, i.e.

Dict+1 = Dict∪ {zt}, otherwise Dict+1 = Dict. After testing all the elements in the data samples

set D, the matrix K_Dic−1 can be computed, this is in the offline learning method. For online learning method, the matrix K_Dic−1 can be updated at each time step [14].

At each time step t, if the error that results from testing the basis functions of ztis smaller than v, then

Dict+1= Dictand KDic−1t+1 = K

−1

Dict, otherwise Dict+1= Dict∪ {zt}. The matrix K

−1 Dict+1 is updated as follows: K_Dic−1 t+1 = 1 errort " errortKDic−1t −ct −cT t 1 # (11)

3 Online Kernel Based Least Squares Policy Iteration

Online kernel-based LSPI is a kernelised version of online-LSPI and the pseudocode is given in Algorithm 1. At each time step t, online kernel-based LSPI updates the matrix ˆA and the vector ˆb, (steps: 15-16) in Algorithm 1. After few samples Kθobtained from the environment, online kernel-based LSPI estimates the

weight vector wπ_{, for the current policy π and computes the corresponding ˆ}_Qπ_{, i.e. approximated policy}

evaluation. Then online kernel-based LSPI derives an improved new learned policy π0 which is a greedy one, i.e. approximated policy improvement, (steps: 17-20) in Algorithm 1.

At each time step t, Algorithm 1 uses the KG exploration policy to select an action atin the state st,

(step: 4) and performs the ALD test on the basis functions of this state-action pair (st, at) to provide feature

selection, (steps: 9-13).

4 Experimental Results and Evaluation

In this section, we describe the test domain and the experimental setup followed by the experiments where we compare online LSPI and online KBLSPI using KG policy. All experiments are implemented in MATLAB.

4.1 Test Domain and Experimental Setup

The test domain consists of 2 MDPs as shown in Figure 1, each with discount factor γ = 0.9. The first domain is the 50-chain [5]. The 50-chain consists of a sequence of 50 states, labeled from s1to s50. In each

state, the agent has 2 actions, either GoRight (R) or GoLeft (L). The actions succeed with probability 0.9 changing the state in the intended direction and fail with probability 0.1 changing the state in the opposite direction. The agent gets reward 1 in states s10and s41and 0 elsewhere. The optimal policy is to R from

state s1 through state s10 and from state s26 through state s40, and to L from state s11 through state s25

and from state s41through state s50[5]. The grid world, is used in [10]. The agent has 4 actions Go Up,

Down, Leftand Right and for each of them it transits to the intended state with probability 0.7 and fails with probability 0.1 changing the state to the one of other directions. The agent gets reward 1 if it reaches the goal state, −1 if it hits the wall, and 0 elsewhere.

The experimental setup is as follows: For each of the 2 MDPs, we compared between the online-LSPI and online-KBLSPI using knowledge gradient KG policy as an exploration policy. For number of experi-ments equals 1000, each one with length L. The performance measures are: 1) the average frequency at each time step, i.e. at each time step t for each experiment, we computed the probability that the learned

(6)

Algorithm 1 (Online-KBLSPI)

1. Input: |S|, |A|, discount factor γ, initial policy π0, length of trajectory L, accuracy v, policy

improve-ment interval Kθ, set of basis functions Φ = {φ1, · · · , φn}, reward r ∼ N (µa, σ2a).

2. Intialize: ˆA ← 0, ˆb ← 0, st, K|SA|×|SA|=< ΦT, Φ >, Dict= {}, KDic−1t = [], ˆQ|SA|← 0.

3. For t = 1, · · · , L 4. at← KG

5. st, at; observe: next state st+1, reward rt, at+1← πt(st+1)

6. zt← (st) ∗ |A| + at, zt+1← (st+1) ∗ |A| + at+1

7. For zi∈ {zt, zt+1}

8. kT(., zi) = [k(1, zi), · · · , k(j, zi), · · · , k(|Dict|, zi)], c(zi) = KDic−1t∗ k(., zi)

9. error(zi) = k(zi, zi) − kT(., zi) ∗ c(zi)

10. If error(zi) > v

11. Dict+1← Dict∪zi, KDic−1t+1 ←

1 error(zi) " error(zi) KDic−1t −c(zi) −c(zi)T 1 # , At← " At 0 0 0 # , bt← " bt 0 #

12. Else Dict+1← Dict, K_Dic−1_t+1← K_Dic−1_t

13. End if 14. End for 15. At+1← At+ k(., zt)[k(., zt) − γk(., zt+1)]T 16. bt+1← bt+ k(., zt)rt, k(., zt) = [k(1, zt), · · · , k(j, zt), · · · , k(|Dict+1|, zt)]T 17. If t = (l + 1)Kθthen 18. wl← A−1t+1bt+1, ˆQl(z) = wTl ∗ k(., z), z = z1, z2, · · · , z|SA|and k(., z) = [k(1, z), · · · , k(j, z), · · · , k(|Dict+1|, z)]T 19. π0← argmaxaQˆl(s, a) ∀s ∈ S, πt← π0, l ← l + 1 20. End if 21. st← st+1 22. End for

23. Output: At each time step t, write the reward rtand the learned policy πt

policy (step: 19) in Algorithm 1 reached to the optimal policy, then we took the average of 1000 experiments to give us the average frequency at each time step. 2) the average cumulative frequency at each time step, i.e. the cumulative average frequency at each time step t. [7] used the 50-chain domain with length of tra-jectories L equals 5000, therefore, we used the same horizon. For the grid domain we adapted the length of trajectories L according to the number of states, i.e. as the number of states is increased, L will be increase. L is set to 18800 for the grid world.

KG policy, needs estimated standard deviation and estimated mean for each state-action pair. Therefore, we assume that the reward has a normal distribution. For example, for the 50-chain problem, the agent is rewarded 1 if it goes to state 10, therefore, we set the reward in s10 to N (µ1, σa2), where µ1 = 1. And

the agent is rewarded 0 if it goes to s1, therefore, we set the reward to N (µ2, σa2), where µ2 = 0. σa is

the standard deviation of the reward which is set fixed and equal for each action, i.e. σa = 0.01, 0.1, 1.

KG exploration policy is a full optimistic policy, therefore, we set the policy improvement interval Kθ to

1. For each run, the initial state s0was selected uniformly, randomly from the state space S. We used the

pseudo-inverse when the matrix ˆA is non-invertible [7].

For online KBLSPI, we define a kernel function K on state-action pairs, i.e. K : |SA| × |SA| → R, we composed K into a state kernel Ks, i.e. Ks: |S| × |S| → R and an action kernel Ka, i.e. Ka : |A| × |A| →

R as [14]. Therefore, the kernel function K is K = Ks⊗ Ka where ⊗ is the Kronecker product3. The

kernel state Ksis a Gaussian kernel, i.e. k(s, s0) = exp−||s−s

0_||2_/(2σ2

ks)_{where σ}_ks_{is the standard deviation}

of the kernel state function, s is the state at time t and s0 is the state at time t + 1. And, the action kernel is a Gaussian kernel, i.e. k(a, a0) = exp− ||a − a0 ||2/(2σ2_ka)_{where σ}

kais the standard deviation of the kernel

3_{If K}

(7)

Figure 1: Left sub-figure is the 50-chain domain, in the red cells, the agent gets rewards. Right sub-figure is the grid with 188 accessible states. The arrows show the optimal actions in each state.

Figure 2: Performance of the average frequency by the KG policy in LSPI in blue and KG in online-KBLSPI in red. Left-figure shows the performance on the 50-chain. Right-figure shows the performance on the grid world. Using standard deviation of reward σa= 1.

action function, a is the action at time t and a0 is the action at time t + 1. s and s0, and a and a0 are normalized as [12], e.g. for 50-chain with number of states |S| = 50 and number of actions |A| = 2, s, s0 ∈ {1_/_|S|_{, · · · ,}50_/_|S|_{} and a, a}0 _{∈ {0.5, 1}. σ}_ks_{and σ}_ka_{are tuned empirically and set to 0.55 for the}

chain domain and 2.25 for the grid world. We set the accuracy v in the approximated kernel basis to 0.0001. For online-LSPI, we used Gaussian basis functions φs = exp−||s−ci||

2_/(2σ2

Φ) _{where φ}_s _{is the basis}

functions for state s with center nodes (ci)ni=1which are set with equal distance between each other, and σΦ

is the standard deviation of the basis functions which is set to 0.55.

The number of basis functions n is set to 10 for 50-chain as [5] and 40 for the grid world as [6].

4.2 Experimental Results

Using different values of the standard deviation of reward, i.e. σa = 0.01, 0.1 and 1. We compared the

performance of online LSPI and KBLSPI using knowledge gradient policy KG as an exploration policy. The experimental results on the 50-chain and the grid show that the online-KBLSPI outperforms the online-LSPI according to the average frequency and cumulative average frequency of optimal policy perfor-mances for all values of the standard deviation of reward σai.e. σa = 0.01, 0.1 and 1. Figure 2 shows how

the performance of the learned policy is increased by using online-KBLSPI on the 50-chain and the grid. The results clearly show that the online KBLSPI usually converges faster than the online LSPI to the (near) optimal policies, i.e. the performance of the online KBLSPI is increased. Although, the performance of the online LSPI is better in the beginning and this is because the online LSPI uses its all basis functions, while online KBLSPI incrementally constructs its basis functions by the kernel sparsification method.

(8)

5 Conclusion and Future Work

The main conclusion is that the performance of KBLSPI outperforms the performance of online-LSPI. Moreover, we do not need to tune empirically the center nodes for the online-KBLSPI, since it pro-vides features selection by using the approximate linear dependency method. Future work must compare the performance of online-LSPI and online-KBLSPI using other types of basis functions, e.g. the hybrid short-est path basis functions [15], must compare the performance using continuous MDP domain, e.g. Interval pendulum and must use other balancing policies as exploration policies, e.g. interval estimation policy [3].

References

[1] Y. Engel and R. Meir. Algorithms and Representations for Reinforcement Learning. PhD thesis, Computer Science Department, Senate of the Hebrew, 2005.

[2] W.B. Powell I.O. Ryzhov and P.I. Frazier. The knowledge-gradient policy for a general class of online learning problems. Operation Research, 60(1):180–195, 2012.

[3] L.P. Kaelbling. Learning in Embedded Systems. MIT Press, 1993.

[4] D. Ernst L. Bus¸oniu, B. De Schutter, and R. Babuˇska. Online least-squares policy iteration for rein-forcement learning control. In Proceedings of the 2010 American Control Conference (ACC), pages 486–491, Baltimore, Maryland, 2010.

[5] M.G. Lagoudakis and R. Parr. Model-Free Least Squares Policy Iteration. PhD thesis, Computer Science Department, Duke University, Durham, North Carolina, United States, 2003.

[6] C. Towell M. Sugiyama, H. Hachiya and S. Vijayakumar. Geodesic gaussian kernels for value function approximation. Journal of Autonomous Robots, 25(3):287–304, 2008.

[7] S. Mahadevan. Representation Discovery Using Harmonic Analysis. Morgan and Claypool Publishers, 2008.

[8] M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., New York, USA, 1994.

[9] B. Sch¨olkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA, USA, 2002.

[10] R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, USA, 1998.

[11] V. Vapnik. The Grid: Statistical Learning Theory. Wiley, New York, USA, 1998.

[12] D. Hu X. Xu and X. Lu. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 18(4):973–992, 2007.

[13] S. Mannor Y. Engel and R. Meir. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 52(8):2275–2285, 2004.

[14] S. Mannor Y. Engel and R. Meir. Reinforcement learning with gaussian processes. In 22nd Interna-tional Conference on Machine learning (ICML), New York, NY, USA, 2005.

[15] S. Yahyaa and B. Manderick. Shortest path gaussian kernels for state action graphs: An empirical study. In 24th Benelux Conference on Artificial Intelligence (BNAIC), 2012.

[16] S. Yahyaa and B. Manderick. Knowledge gradient exploration in online least squares policy iteration. In 5th International Conference on Agents and Artificial Intelligence (ICAART), Barcelona, Spain, 2013. Springer-Verlag.