E. D R A B I K (Bia lystok) L. S T E T T N E R (Warszawa)
ON ADAPTIVE CONTROL OF MARKOV CHAINS USING NONPARAMETRIC ESTIMATION
Abstract. Two adaptive procedures for controlled Markov chains which are based on a nonparametric window estimation are shown.
1. Introduction. Assume on a probability space (Ω, F, P ) we are given a discrete time controlled Markov process X = (x
i) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p
v(i, j) depending on a control parameter v ∈ [0, 1]. Assume furthermore that for i, j ∈ E the mapping [0, 1] 3 v 7→ p
v(i, j) is continuous.
Our purpose is to minimize the following average cost per unit time functional:
(1) J
x(V ) = lim sup
n→∞
1
n E
xVn X
ni=1
c(x
i, v
i) o
over all sequences V = (v
i) of [0, 1]-valued σ{x
0, . . . , x
i}-measurable random variables, where E
xVstands for conditional expected value given that the controlled process (x
i) starts from the state x and the control V is used, and c : E × [0, 1] → R is a continuous function which measures the running cost.
An element u = [u
1, . . . , u
k] of the set U = [0, 1]
kwill be later interpreted as a Markov control in the sense that we shall use a control parameter equal to u
jwhen the state process x
iis in the state j.
1991 Mathematics Subject Classification: 93E20, 93C40, 62M05.
Key words and phrases: adaptive control, controlled Markov chain, estimation.
The work was supported by KBN grant no. 2 P03A 01515.
[143]
Given a nondecreasing sequence {b
n}
n∈Nof positive integers such that b
n→ ∞ as n → ∞, define the set
Φ({b
n}
n∈N) = n
(a
ni), i = 1, . . . , n; n = 1, 2, . . . : a
ni∈ {0, 1},
n
X
i=1
a
ni≥ b
no . The following auxiliary result will be used to justify the control proce- dures introduced in Sections 2 and 3.
Proposition 1. Let Y
ibe a sequence of real-valued random variables such that E[Y
i+1| Y
1, . . . , Y
i] = 0 and M = sup
iE{Y
i2} < ∞. Then
(2) sup
(ani)∈Φ({bn})
P
n i=1a
niY
iP
ni=1
a
ni→ 0 in probability as n → ∞.
P r o o f. Assume contrary to (2) that P
P
n i=1a
niY
iP
ni=1
a
ni≥ ε
does not converge to 0 as n → ∞ for (a
ni) ∈ Φ({b
n}). Then by the Chebyshev inequality
P
P
n i=1a
niY
iP
ni=1
a
ni≥ ε
≤ E[( P
ni=1
a
niY
i)
2] ε
2( P
ni=1
a
ni)
2= P
ni=1
(a
ni)
2(EY
i)
2ε
2( P
ni=1
a
ni)
2≤ M
ε
2P
ni=1
a
ni≤ M b
nε
2→ 0 and we have a contradiction.
Let e u
i∈ U , i = 0, 1, . . . , be a sequence equidistributed in U . Given a sequence h
n& 0, for u = [u
1, . . . , u
k] ∈ U and e u
ijbeing the jth coordinate of e u
idefine
(3) F
n(u) =
n
X
i=0 k
Y
j=1
1
{|uj−ueij|≤hn}
.
In what follows we shall assume that h
nis chosen such that
(4) f
n:= min
u∈U
F
n(u) → ∞
as n → ∞. In particular, we can choose for e u
isuccessively the centers
of cubes with edges of length 1/2
jwhich cover the set U = [0, 1]
kfor
j = 0, 1, . . . and consider the sequences u e
iand h
iof the form
e u
0=
12
, . . . ,
12, h
0=
12, e u
1=
14
,
14, . . . ,
14, h
1= . . . = h
2k=
13, e u
2=
34
,
14, . . . ,
14, .. .
e u
2k=
34
,
34, . . . ,
34, u e
2k+1=
18
,
18, . . . ,
18, h
2k+1= . . . = h
2k+4k=
14, .. .
e u
2k+4k=
78
,
78, . . . ,
78, u e
2k+4k+1=
116
,
161, . . . ,
161, h
2k+4k+1= . . . = h
2k+4k+8k=
15, .. .
u e
2k+4k+8k=
1516
,
1516, . . . ,
1516, and so on.
In the theory of adaptive control of Markov processes the number of fea- sible procedures is very limited (see the papers [2], [4] and [5]). A recursive self-tuning algorithm proposed in [2] is based on asymptotic properties of ordinary differential equations and requires differentiability of the transition operator with respect to an unknown parameter. Other methods, in which discretized MLE ([4]) or the theory of large deviation ([5]) are used, require the construction of a finite class of ε-optimal controls, which is usually a hard problem.
In this paper we propose an alternative approach based on a window nonparametric estimation used for multiarmed bandit problems in [1]. As- suming that our model is uniformly ergodic (assumptions (5) and (13)), although we do not know the transition probabilities, it appears that we are able to construct an adaptive procedure for which we obtain self-optimality.
The paper consists of three sections. In Section 2 we introduce an adap- tive procedure based on nonparametric estimation of the cost functional. In Section 3 another procedure that is based on nonparametric estimation of the transition kernel is considered.
2. Adaptive control with cost estimation. Assume there exists a uniformly positive recurrent state e ∈ E of the Markov process X in the sense that
(5) sup
u∈U
E
eu{τ
2} < ∞, where
(6) τ = inf{i > 0 : x
i= e}.
Note that the above property holds in particular when inf
v∈[0,1]
inf
j∈E
p
v(j, e) > 0.
Let
(7) τ
1= τ, . . . , τ
n+1= τ
n+ τ · Θ
τnwith τ defined in (6) and Θ
τbeing the Markov shift operator. In other words τ
nare the moments of successive returns to the recurrent state e.
Assume now that in the time interval [τ
i, τ
i+1) we use a Markov con- trol u e
i. For u ∈ U define
(8) G
n(u) =
n
X
i=0 k
Y
j=1
1
{|uj−ueij|≤hn} τi+1−1
X
r=τi
c(x
r, u e
i(x
r)) and
(9) H
n(u) =
n
X
i=0 k
Y
j=1
1
{|uj−ueij|≤hn}
(τ
i+1− τ
i).
Notice that G
n(u) is the total cost incurred in the time interval [0, τ
n+1), when the control u e
ifrom the closed ball with center u and radius h
nis used.
Similarly H
n(u) is the total time during which the control from the sequence ( u e
i) lies in the closed ball with center u and radius h
n.
Proposition 2. We have
(10) sup
u∈U
G
n(u)
F
n(u) − E
un
τ −1X
i=0
c(x
i, u(x
i)) o
→ 0 and
(11) sup
u∈U
H
n(u)
F
n(u) − E
u{τ }
→ 0 in probability as n → ∞, and consequently
(12) sup
u∈U
G
n(u) H
n(u) − X
η∈E
c(η, u(η))π
u(η)
→ 0
in probability as n → ∞, where π
uis the unique invariant measure corre- sponding to the Markov process X with Markov control u.
P r o o f. Let Y
i=
τi−1
X
r=τi−1
c(x
r, e u
i(x
r)) − E
euein
τ −1X
r=0
c(x
r, e u
i(x
r)) o ,
with τ
0= 0. Clearly E[Y
i+1| Y
1, . . . , Y
i] = 0 and from the boundedness of
c(·, ·) and (5) we have sup
iE
eY
i2< ∞.
Consequently, from Proposition 1,
| P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
( P
τi−1r=τi−1
c(x
r, u e
i(x
r))−E
euei{ P
τ −1r=0
c(x
r, u e
i(x
r))})|
F
n(u)
converges to 0 in probability as n → ∞, uniformly in u ∈ U .
Note that under (5), by continuity of p
v(e, j) with respect to v, the mapping
U 3 u 7→ E
eun
τ −1X
r=0
c(x
r, u e
i(x
r)) o
,
where U is endowed with the Euclidean norm, is continuous. Therefore, since h
n→ 0, we obtain
sup
u∈U
G
n(u)
F
n(u) − E
eun
τ −1X
r=0
c(x
r, u(x
r)) o
→ 0
in probability, which completes the proof of (10). The proof of (11) is similar.
We simply let c(·, ·) ≡ 1 in the previous considerations. The convergence (12) follows directly from (10) and (11) upon noticing that
E
eu{ P
τ −1r=0
c(x
r, u(x
r))}
E
u{τ } = X
η∈E
c(η, u(η))π
u(η),
where the existence of a unique invariant measure π
uand its form are guar- anteed by assumption (5).
We are now in a position to formulate our first control procedure:
For a given ε > 0 find a positive integer n
εsuch that P
sup
u∈U
G
n(u) H
n(u) − X
η∈E
c(η, u(η))π
u(η)
≥ ε
≤ ε kck for n ≥ n
εwith k · k standing for the supremum norm.
For the first n
εcycles, i.e. until time τ
nε+1, test controls from the se- quence e u
iusing the Markov controls u e
iin the time intervals [τ
i, τ
i+1). At time τ
nε+1find a control u
δthat is δ-optimal for G
nε(u)/H
nε(u), i.e. such that
G
nε(u
δ) H
nε(u
δ) ≤ inf
u∈U
G
nε(u) H
nε(u) + δ,
and use this control function for each i ≥ τ
nε+1. Denote the above control
procedure by V
c.
Theorem 1. We have J
x(V
c) ≤ inf
u∈U
h X
η∈E
c(η, u(η))π
u(η) i
+ 3ε + δ.
P r o o f. Because of the form of the cost functional (1) and boundedness of the cost function c, only controls after time τ
nε+1have an effect on the value of the cost functional J and therefore
J
x(V
c) = E
xh X
η∈E
c(η, u
δ(η))π
uδ(η) i . Let
B =
sup
u∈U
G
nε(u) H
nε(u) − X
η∈E
c(η, u(η))π
u(η)
< ε
. For ω ∈ B have
X
η∈E
c(η, u
δ(η))π
uδ(η) ≤ ε + G
nε(u
δ)
H
nε(u
δ) ≤ ε + δ + inf
u∈U
G
nε(u) H
nε(u)
≤ inf
u∈U
h X
η∈E
c(η, u(η))π
u(η) i
+ 2ε + δ.
Consequently, J
x(V
c) ≤ inf
u∈U
h X
η∈E
c(η, u(η))π
u(η) i
+ 2ε + δ + inf
u∈U
P [1
Bckck]
≤ inf
u∈U
h X
η∈E
c(η, u(η))π
u(η) i
+ 3ε + δ, which completes the proof.
3. Adaptive control procedure with transition probability es- timation. In this section we estimate the transition probability function p
u(i, j). Assume now that the Markov process X = (x
i) is controlled using the sequence u e
iat time i. For l, l
0∈ E let (cf. (3) and (8))
G
l,ln0(u) =
n
X
i=0 k
Y
j=1
1
{|uj−ueij|≤hn}
1
l(x
i)1
l0(x
i+1) and
F
nl(u) =
n
X
i=0 k
Y
j=1
1
{|uj−ueij|≤hn}
1
l(x
i).
Assume now that there is κ > 0 such that for l, l
0∈ E,
(13) inf
v∈[0,1]
p
v(l, l
0) > κ.
Proposition 3. We have sup
u∈U
G
l,ln0(u)
F
nl(u) − p(l, l
0, u
l)
→ 0 in probability as n → ∞, where p(l, l
0, v) := p
v(l, l
0).
P r o o f. Let
Y
i= 1
l(x
i)1
l0(x
i+1) − 1
l(x
i)p(l, l
0, e u
il).
Clearly E[Y
i| Y
1, . . . , Y
i−1] = 0. Therefore by Proposition 1,
(14) sup
u∈U
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
Y
iF
n(u) → 0
in probability as n → ∞. Using Proposition 1 again with Y
i0= 1
l(x
i+1) − p(x
i, l, u e
xi) we obtain
sup
u∈U
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
Y
i0F
n(u) → 0.
Therefore by (13) for κ > ε > 0 we have P
u∈U
inf F
nl(u)
F
n(u) > κ − ε
≤ P
sup
u∈U
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
Y
i0F
n(u) ≥ ε
→ 0 in probability as n → ∞, and from (14) we obtain
sup
u∈U
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
Y
iF
nl(u) → 0 in probability as n → ∞. Hence
sup
u∈U
G
l,ln0(u) F
nl(u) −
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
1
l(x
i)p(l, l
0, u e
il) F
nl(u)
→ 0 in probability as n → ∞ and by continuity of p(l, l
0, v) with respect to v we finally obtain
P
n i=0Q
kj=1
1
{|uj−ueij|≤hn}
1
l(x
i)p(l, l
0, u e
il)
F
nl(u) → p(l, l
0, u
l)
uniformly in u ∈ U as n → ∞, which completes the proof.
Our second adaptive procedure consists of the following steps:
1. For a given ε > 0 find n
εsuch that for n ≥ n
εand l, l
0∈ E, P
sup
u∈U
G
l,ln0(u)
F
nl(u) − p(l, l
0, u
l)
> ε
≤ ε
kckk
2.
2. At time n
εnormalize G
l,lnε0(u)/F
nl(u), i.e. form a new transition matrix
p(l, l e
0, u) = G
l,ln0(u) P
kr=1
G
l,rn(u) .
3. Then find an invariant measure π
uεfor the transition matrix p(l, l e
0, u) and determine u
δsuch that
X
η
c(η, u
δ(η))π
εuδ(η) ≤ inf
u∈U
X
η
c(η, u
δ(η))π
εu(η) + δ.
4. Starting from time n
εuse the control function u
δ. Denote the above control procedure by V
p.
Theorem 2. Under (13) we have J (V
p) ≤ inf
u∈U
h X
η∈E
c(η, u(η))π
u(η) i
+ kck (1 + k)ε
(1 − kε)κ + δ + ε, where π
uis the unique invariant measure corresponding to p(l, l
0, u
l).
P r o o f. If for each l, l
0∈ E, sup
u∈U
G
l,ln0(u)
F
nl(u) − p(l, l
0, u
l)
≤ ε then we have
| p(l, l e
0, u) − p(l, l
0, u
l)| = F
nl(u) P
kr=1
G
l,rn(u)
G
l,ln0(u)
F
nl(u) − p(l, l
0, u
l)
n
X
l0=1
G
l,ln0(u) F
nl(u)
≤
ε + p(l, l
0, u
l)
1 −
n
X
l0=1
G
l,ln0(u) F
nl(u)
1
1 − kε
≤ (1 + k)ε 1 1 − kε .
From Theorem and Corollary 2 of [7] under (13) we see that for l ∈ E, sup
u∈U
|π
uε(l) − π
u(l)| ≤ 1
2 · (1 + k)ε (1 − kε)κ . Therefore for ω ∈ T
kl=1
T
kl0=1
B
ll0, where
B
ll0=
sup
u∈U
G
l,ln0(u)
F
nl(u) − p(l, l
0, u
l)
≤ ε
, we have
X
η∈E
c(η, u(η))π
εuδ(η) ≤ inf
u∈U
X
η∈E
c(η, u(η))π
u(η) + kck (1 + k)ε (1 − kε)κ + δ.
Consequently,
J (V
p) = E h X
η∈E
c(η, u(η))π
εuδ(η) i
≤ inf
u∈U
X
η∈E
c(η, u(η))π
u(η) + kck (1 + k)ε (1 − kε)κ + δ + kckP
Ω\
k
\
l=1 k
\
l0=1
B
ll0≤ inf
u∈U
X
η∈E