) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p

(1)

E. D R A B I K (Bia lystok) L. S T E T T N E R (Warszawa)

ON ADAPTIVE CONTROL OF MARKOV CHAINS USING NONPARAMETRIC ESTIMATION

Abstract. Two adaptive procedures for controlled Markov chains which are based on a nonparametric window estimation are shown.

1. Introduction. Assume on a probability space (Ω, F, P ) we are given a discrete time controlled Markov process X = (x

i

) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p

^v

(i, j) depending on a control parameter v ∈ [0, 1]. Assume furthermore that for i, j ∈ E the mapping [0, 1] 3 v 7→ p

^v

(i, j) is continuous.

Our purpose is to minimize the following average cost per unit time functional:

(1) J

x

(V ) = lim sup

n→∞

1 n E

_x^V

n X

ⁿ

i=1

c(x

i

, v

i

) o

over all sequences V = (v

i

) of [0, 1]-valued σ{x

0

, . . . , x

i

}-measurable random variables, where E

_x^V

stands for conditional expected value given that the controlled process (x

i

) starts from the state x and the control V is used, and c : E × [0, 1] → R is a continuous function which measures the running cost.

An element u = [u

1

, . . . , u

k

] of the set U = [0, 1]

^k

will be later interpreted as a Markov control in the sense that we shall use a control parameter equal to u

j

when the state process x

i

is in the state j.

1991 Mathematics Subject Classification: 93E20, 93C40, 62M05.

Key words and phrases: adaptive control, controlled Markov chain, estimation.

The work was supported by KBN grant no. 2 P03A 01515.

[143]

(2)

Given a nondecreasing sequence {b

n

}

_n∈N

of positive integers such that b

n

→ ∞ as n → ∞, define the set

Φ({b

n

}

_n∈N

) = n

(a

ⁿ_i

), i = 1, . . . , n; n = 1, 2, . . . : a

ⁿ_i

∈ {0, 1},

n

X

i=1

a

ⁿ_i

≥ b

_n

o . The following auxiliary result will be used to justify the control procedures introduced in Sections 2 and 3.

Proposition 1. Let Y

i

be a sequence of real-valued random variables such that E[Y

i+1

| Y

₁

, . . . , Y

i

] = 0 and M = sup

_i

E{Y

_i²

} < ∞. Then

(2) sup

(aⁿ_i)∈Φ({bn})

P

n i=1

a

ⁿ_i

Y

i

P

n

i=1

a

ⁿ_i

→ 0 in probability as n → ∞.

P r o o f. Assume contrary to (2) that P

P

n i=1

a

ⁿ_i

Y

i

P

n

i=1

a

ⁿ_i

≥ ε

does not converge to 0 as n → ∞ for (a

ⁿ_i

) ∈ Φ({b

n

}). Then by the Chebyshev inequality

P

n i=1

a

ⁿ_i

Y

i

P

n

i=1

a

ⁿ_i

≥ ε

≤ E[( P

n

i=1

a

ⁿ_i

Y

i

)

²

] ε

²

( P

n

i=1

a

ⁿ_i

)

²

= P

n

i=1

(a

ⁿ_i

)

²

(EY

i

)

²

ε

²

( P

n

i=1

a

ⁿ_i

)

²

≤ M

ε

²

P

n

i=1

a

ⁿ_i

≤ M b

n

ε

²

→ 0 and we have a contradiction.

Let e u

ⁱ

∈ U , i = 0, 1, . . . , be a sequence equidistributed in U . Given a sequence h

n

& 0, for u = [u

₁

, . . . , u

k

] ∈ U and e u

ⁱ_j

being the jth coordinate of e u

ⁱ

define

(3) F

n

(u) =

n

X

i=0 k

Y

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

.

In what follows we shall assume that h

n

is chosen such that

(4) f

n

:= min

u∈U

F

n

(u) → ∞

as n → ∞. In particular, we can choose for e u

ⁱ

successively the centers

of cubes with edges of length 1/2

^j

which cover the set U = [0, 1]

^k

for

j = 0, 1, . . . and consider the sequences u e

ⁱ

and h

i

of the form

(3)

e u

⁰

=

₁

2

, . . . ,

¹₂

, h

0

=

¹₂

, e u

¹

=

₁

4

,

¹₄

, . . . ,

¹₄

, h

1

= . . . = h

₂^k

=

¹₃

, e u

²

=

₃

4

,

¹₄

, . . . ,

¹₄

, .. .

e u

²^k

=

₃

4

,

³₄

, . . . ,

³₄

, u e

²^k⁺¹

=

₁

8

,

¹₈

, . . . ,

¹₈

, h

2^k+1

= . . . = h

2^k+4^k

=

¹₄

, .. .

e u

²^k⁺⁴^k

=

₇

8

,

⁷₈

, . . . ,

⁷₈

, u e

²^k⁺⁴^k⁺¹

=

₁

16

,

₁₆¹

, . . . ,

₁₆¹

, h

₂^k+₄^k₊₁

= . . . = h

₂^k+₄^k₊₈^k

=

¹₅

, .. .

u e

²^k⁺⁴^k⁺⁸^k

=

₁₅

16

,

¹⁵₁₆

, . . . ,

¹⁵₁₆

, and so on.

In the theory of adaptive control of Markov processes the number of fea- sible procedures is very limited (see the papers [2], [4] and [5]). A recursive self-tuning algorithm proposed in [2] is based on asymptotic properties of ordinary differential equations and requires differentiability of the transition operator with respect to an unknown parameter. Other methods, in which discretized MLE ([4]) or the theory of large deviation ([5]) are used, require the construction of a finite class of ε-optimal controls, which is usually a hard problem.

In this paper we propose an alternative approach based on a window nonparametric estimation used for multiarmed bandit problems in [1]. As- suming that our model is uniformly ergodic (assumptions (5) and (13)), although we do not know the transition probabilities, it appears that we are able to construct an adaptive procedure for which we obtain self-optimality.

The paper consists of three sections. In Section 2 we introduce an adaptive procedure based on nonparametric estimation of the cost functional. In Section 3 another procedure that is based on nonparametric estimation of the transition kernel is considered.

2. Adaptive control with cost estimation. Assume there exists a uniformly positive recurrent state e ∈ E of the Markov process X in the sense that

(5) sup

u∈U

E

_e^u

{τ

²

} < ∞, where

(6) τ = inf{i > 0 : x

i

= e}.

(4)

Note that the above property holds in particular when inf

v∈[0,1]

inf

j∈E

p

^v

(j, e) > 0.

Let

(7) τ

1

= τ, . . . , τ

n+1

= τ

n

+ τ · Θ

τn

with τ defined in (6) and Θ

τ

being the Markov shift operator. In other words τ

n

are the moments of successive returns to the recurrent state e.

Assume now that in the time interval [τ

i

, τ

i+1

) we use a Markov control u e

ⁱ

. For u ∈ U define

(8) G

n

(u) =

n

X

i=0 k

Y

j=1

1

_{|u_j₋

ueⁱ_j|≤hn} τi+1−1

X

r=τi

c(x

r

, u e

ⁱ

(x

r

)) and

(9) H

n

(u) =

n

X

i=0 k

Y

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

(τ

i+1

− τ

_i

).

Notice that G

n

(u) is the total cost incurred in the time interval [0, τ

n+1

), when the control u e

ⁱ

from the closed ball with center u and radius h

n

is used.

Similarly H

n

(u) is the total time during which the control from the sequence ( u e

ⁱ

) lies in the closed ball with center u and radius h

n

.

Proposition 2. We have

(10) sup

u∈U

G

n

(u)

F

n

(u) − E

^u

n

^{τ −1}

X

i=0

c(x

i

, u(x

i

)) o

→ 0 and

(11) sup

u∈U

H

n

(u)

F

n

(u) − E

^u

{τ }

→ 0 in probability as n → ∞, and consequently

(12) sup

u∈U

G

n

(u) H

n

(u) − X

η∈E

c(η, u(η))π

^u

(η)

→ 0

in probability as n → ∞, where π

^u

is the unique invariant measure corresponding to the Markov process X with Markov control u.

P r o o f. Let Y

i

=

τi−1

X

r=τi−1

c(x

r

, e u

ⁱ

(x

r

)) − E

_e^u^eⁱ

n

^{τ −1}

X

r=0

c(x

r

, e u

ⁱ

(x

r

)) o ,

with τ

₀

= 0. Clearly E[Y

i+1

| Y

₁

, . . . , Y

i

] = 0 and from the boundedness of

c(·, ·) and (5) we have sup

_i

E

_e

Y

_i²

< ∞.

(5)

Consequently, from Proposition 1,

| P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

( P

τi−1

r=τi−1

c(x

r

, u e

ⁱ

(x

r

))−E

_e^u^eⁱ

{ P

τ −1

r=0

c(x

r

, u e

ⁱ

(x

r

))})|

F

n

(u)

converges to 0 in probability as n → ∞, uniformly in u ∈ U .

Note that under (5), by continuity of p

^v

(e, j) with respect to v, the mapping

U 3 u 7→ E

_e^u

n

^{τ −1}

X

r=0

c(x

r

, u e

ⁱ

(x

r

)) o

,

where U is endowed with the Euclidean norm, is continuous. Therefore, since h

n

→ 0, we obtain

sup

u∈U

G

n

(u)

F

n

(u) − E

_e^u

n

^{τ −1}

X

r=0

c(x

r

, u(x

r

)) o

→ 0

in probability, which completes the proof of (10). The proof of (11) is similar.

We simply let c(·, ·) ≡ 1 in the previous considerations. The convergence (12) follows directly from (10) and (11) upon noticing that

E

_e^u

{ P

τ −1

r=0

c(x

r

, u(x

r

))}

E

^u

{τ } = X

η∈E

c(η, u(η))π

^u

(η),

where the existence of a unique invariant measure π

^u

and its form are guar- anteed by assumption (5).

We are now in a position to formulate our first control procedure:

For a given ε > 0 find a positive integer n

ε

such that P

sup

u∈U

G

n

(u) H

n

(u) − X

η∈E

c(η, u(η))π

^u

(η)

≥ ε

≤ ε kck for n ≥ n

ε

with k · k standing for the supremum norm.

For the first n

ε

cycles, i.e. until time τ

nε+1

, test controls from the sequence e u

ⁱ

using the Markov controls u e

ⁱ

in the time intervals [τ

i

, τ

i+1

). At time τ

nε+1

find a control u

δ

that is δ-optimal for G

_n_ε_(u)

/H

nε

(u), i.e. such that

G

nε

(u

δ

) H

nε

(u

δ

) ≤ inf

u∈U

G

nε

(u) H

nε

(u) + δ,

and use this control function for each i ≥ τ

nε+1

. Denote the above control

procedure by V

c

.

(6)

Theorem 1. We have J

x

(V

c

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

^u

(η) i

+ 3ε + δ.

P r o o f. Because of the form of the cost functional (1) and boundedness of the cost function c, only controls after time τ

nε+1

have an effect on the value of the cost functional J and therefore

J

x

(V

c

) = E

_x

h X

η∈E

c(η, u

δ

(η))π

^u^δ

(η) i . Let

B =

sup

u∈U

G

nε

(u) H

nε

(u) − X

η∈E

c(η, u(η))π

^u

(η)

< ε

. For ω ∈ B have

X

η∈E

c(η, u

δ

(η))π

^u^δ

(η) ≤ ε + G

nε

(u

δ

)

H

nε

(u

δ

) ≤ ε + δ + inf

u∈U

G

nε

(u) H

nε

(u)

≤ inf

u∈U

h X

η∈E

c(η, u(η))π

^u

(η) i

+ 2ε + δ.

Consequently, J

x

(V

c

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

^u

(η) i

+ 2ε + δ + inf

u∈U

P [1

B^c

kck]

≤ inf

u∈U

h X

η∈E

c(η, u(η))π

^u

(η) i

+ 3ε + δ, which completes the proof.

3. Adaptive control procedure with transition probability estimation. In this section we estimate the transition probability function p

^u

(i, j). Assume now that the Markov process X = (x

i

) is controlled using the sequence u e

ⁱ

at time i. For l, l

⁰

∈ E let (cf. (3) and (8))

G

^l,l_n⁰

(u) =

n

X

i=0 k

Y

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

1

l

(x

i

)1

l⁰

(x

i+1

) and

F

_n^l

(u) =

n

X

i=0 k

Y

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

1

l

(x

i

).

(7)

Assume now that there is κ > 0 such that for l, l

⁰

∈ E,

(13) inf

v∈[0,1]

p

^v

(l, l

⁰

) > κ.

Proposition 3. We have sup

u∈U

G

^l,l_n⁰

(u)

F

_n^l

(u) − p(l, l

⁰

, u

l

)

→ 0 in probability as n → ∞, where p(l, l

⁰

, v) := p

^v

(l, l

⁰

).

P r o o f. Let

Y

i

= 1

l

(x

i

)1

l⁰

(x

i+1

) − 1

l

(x

i

)p(l, l

⁰

, e u

ⁱ_l

).

Clearly E[Y

i

| Y

₁

, . . . , Y

i−1

] = 0. Therefore by Proposition 1,

(14) sup

u∈U

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

Y

i

F

n

(u) → 0

in probability as n → ∞. Using Proposition 1 again with Y

_i⁰

= 1

l

(x

i+1

) − p(x

i

, l, u e

xi

) we obtain

sup

u∈U

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

Y

_i⁰

F

n

(u) → 0.

Therefore by (13) for κ > ε > 0 we have P

u∈U

inf F

_n^l

(u)

F

n

(u) > κ − ε

≤ P

sup

u∈U

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

Y

_i⁰

F

n

(u) ≥ ε

→ 0 in probability as n → ∞, and from (14) we obtain

sup

u∈U

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

Y

i

F

_n^l

(u) → 0 in probability as n → ∞. Hence

sup

u∈U

G

^l,l_n⁰

(u) F

_n^l

(u) −

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

1

l

(x

i

)p(l, l

⁰

, u e

ⁱ_l

) F

_n^l

(u)

→ 0 in probability as n → ∞ and by continuity of p(l, l

⁰

, v) with respect to v we finally obtain

P

n i=0

Q

k

j=1

1

_{|u_j₋

ueⁱ_j|≤hn}

1

l

(x

i

)p(l, l

⁰

, u e

ⁱ_l

)

F

_n^l

(u) → p(l, l

⁰

, u

l

)

uniformly in u ∈ U as n → ∞, which completes the proof.

(8)

Our second adaptive procedure consists of the following steps:

1. For a given ε > 0 find n

ε

such that for n ≥ n

ε

and l, l

⁰

∈ E, P

sup

u∈U

G

^l,l_n⁰

(u)

F

_n^l

(u) − p(l, l

⁰

, u

l

)

> ε

≤ ε

kckk

²

.

2. At time n

ε

normalize G

^l,l_n_ε⁰

(u)/F

_n^l

(u), i.e. form a new transition matrix

p(l, l e

⁰

, u) = G

^l,l_n⁰

(u) P

k

r=1

G

^l,rn

(u) .

3. Then find an invariant measure π

^u_ε

for the transition matrix p(l, l e

⁰

, u) and determine u

δ

such that

X

η

c(η, u

δ

(η))π

_ε^u^δ

(η) ≤ inf

u∈U

X

η

c(η, u

δ

(η))π

_ε^u

(η) + δ.

4. Starting from time n

ε

use the control function u

δ

. Denote the above control procedure by V

p

.

Theorem 2. Under (13) we have J (V

p

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

^u

(η) i

+ kck (1 + k)ε

(1 − kε)κ + δ + ε, where π

^u

is the unique invariant measure corresponding to p(l, l

⁰

, u

l

).

P r o o f. If for each l, l

⁰

∈ E, sup

u∈U

G

^l,l_n⁰

(u)

F

_n^l

(u) − p(l, l

⁰

, u

l

)

≤ ε then we have

| p(l, l e

⁰

, u) − p(l, l

⁰

, u

l

)| = F

_n^l

(u) P

k

r=1

G

^l,rn

(u)

G

^l,l_n⁰

(u)

F

_n^l

(u) − p(l, l

⁰

, u

l

)

n

X

l⁰=1

G

^l,l_n⁰

(u) F

_n^l

(u)

≤

ε + p(l, l

⁰

, u

l

)

1 −

n

X

l⁰=1

G

^l,l_n⁰

(u) F

_n^l

(u)

1 1 − kε

≤ (1 + k)ε 1 1 − kε .

From Theorem and Corollary 2 of [7] under (13) we see that for l ∈ E, sup

u∈U

|π

^u_ε

(l) − π

^u

(l)| ≤ 1

2 · (1 + k)ε (1 − kε)κ . Therefore for ω ∈ T

k

l=1

T

k

l⁰=1

B

ll⁰

, where

(9)

B

ll⁰

=

sup

u∈U

G

^l,l_n⁰

(u)

F

_n^l

(u) − p(l, l

⁰

, u

l

)

≤ ε

, we have

X

η∈E

c(η, u(η))π

_ε^u^δ

(η) ≤ inf

u∈U

X

η∈E

c(η, u(η))π

^u

(η) + kck (1 + k)ε (1 − kε)κ + δ.

Consequently,

J (V

p

) = E h X

η∈E

c(η, u(η))π

_ε^u^δ

(η) i

≤ inf

u∈U

X

η∈E

c(η, u(η))π

^u

(η) + kck (1 + k)ε (1 − kε)κ + δ + kckP

Ω\

k

\

l=1 k

\

l⁰=1

B

ll⁰

≤ inf

u∈U

X

η∈E

c(η, u(η))π

^u

(η) + kck (1 + k)ε

(1 − kε)κ + δ + ε, which completes the proof.

Remark. The adaptive procedures introduced in Sections 2 and 3 allow one to determine a nearly optimal control in a finite time. Using forcing and an increasing decision horizon from Section 3 of [3] for both adaptive procedures it is possible to construct optimal adaptive strategies.

References

[1] R. A g r a v a l, The continuum-armed bandit problem, SIAM J. Control Optim. 33 (1995), 1926–1951.

[2] V. S. B o r k a r, Recursive self-tuning of finite Markov chains, Appl. Math. (Warsaw) 24 (1996), 169–188.

[3] E. D r a b i k, On nearly selfoptimizing strategies for multiarmed bandit problems with controlled arms, ibid. 23 (1996), 449–473.

[4] T. D u n c a n, B. P a s i k - D u n c a n and L. S t e t t n e r, Discretized maximum likelihood and almost optimal adaptive control of ergodic adaptive models, SIAM J. Control Optim. 36 (1998), 422–446.

[5] —, —, —, Adaptive control of discrete Markov processes by the method of large deviations, in: Proc. 35th IEEE CDC, Kobe 1996, IEEE, 360–365.

[6] O. H e r n ´ a n d e z - L e r m a and R. C a v a z o s - C a d e n a, Density estimation and adaptive control of Markov processes; average and discounted criteria, Acta Appl. Math.

20 (1990), 285–307.

(10)

[7] A. N o w a k, A generalization of Ueno’s inequality for n-step transition probabilities, Appl. Math. (Warsaw) 25 (1998), 295–299.

Ewa Drabik

Faculty of Economics University of Bia lystok Warszawska 63

15-062 Bia lystok, Poland

Lukasz Stettner Institute of Mathematics Polish Academy of Sciences Sniadeckich 8 ´ 00-950 Warszawa, Poland E-mail: stettner@impan.gov.pl and Warsaw School of Management and Marketing

Received on 13.11.1998;

revised version on 27.8.1999

) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p

E. D R A B I K (Bia lystok) L. S T E T T N E R (Warszawa)

ON ADAPTIVE CONTROL OF MARKOV CHAINS USING NONPARAMETRIC ESTIMATION

Abstract. Two adaptive procedures for controlled Markov chains which are based on a nonparametric window estimation are shown.

1. Introduction. Assume on a probability space (Ω, F, P ) we are given a discrete time controlled Markov process X = (x