• Nie Znaleziono Wyników

) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p

N/A
N/A
Protected

Academic year: 2021

Share ") with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p"

Copied!
10
0
0

Pełen tekst

(1)

E. D R A B I K (Bia lystok) L. S T E T T N E R (Warszawa)

ON ADAPTIVE CONTROL OF MARKOV CHAINS USING NONPARAMETRIC ESTIMATION

Abstract. Two adaptive procedures for controlled Markov chains which are based on a nonparametric window estimation are shown.

1. Introduction. Assume on a probability space (Ω, F, P ) we are given a discrete time controlled Markov process X = (x

i

) with values in a finite state space E = {1, . . . , k} and with an unknown transition matrix p

v

(i, j) depending on a control parameter v ∈ [0, 1]. Assume furthermore that for i, j ∈ E the mapping [0, 1] 3 v 7→ p

v

(i, j) is continuous.

Our purpose is to minimize the following average cost per unit time functional:

(1) J

x

(V ) = lim sup

n→∞

1

n E

xV

n X

n

i=1

c(x

i

, v

i

) o

over all sequences V = (v

i

) of [0, 1]-valued σ{x

0

, . . . , x

i

}-measurable random variables, where E

xV

stands for conditional expected value given that the controlled process (x

i

) starts from the state x and the control V is used, and c : E × [0, 1] → R is a continuous function which measures the running cost.

An element u = [u

1

, . . . , u

k

] of the set U = [0, 1]

k

will be later interpreted as a Markov control in the sense that we shall use a control parameter equal to u

j

when the state process x

i

is in the state j.

1991 Mathematics Subject Classification: 93E20, 93C40, 62M05.

Key words and phrases: adaptive control, controlled Markov chain, estimation.

The work was supported by KBN grant no. 2 P03A 01515.

[143]

(2)

Given a nondecreasing sequence {b

n

}

n∈N

of positive integers such that b

n

→ ∞ as n → ∞, define the set

Φ({b

n

}

n∈N

) = n

(a

ni

), i = 1, . . . , n; n = 1, 2, . . . : a

ni

∈ {0, 1},

n

X

i=1

a

ni

≥ b

n

o . The following auxiliary result will be used to justify the control proce- dures introduced in Sections 2 and 3.

Proposition 1. Let Y

i

be a sequence of real-valued random variables such that E[Y

i+1

| Y

1

, . . . , Y

i

] = 0 and M = sup

i

E{Y

i2

} < ∞. Then

(2) sup

(ani)∈Φ({bn})

P

n i=1

a

ni

Y

i

P

n

i=1

a

ni

→ 0 in probability as n → ∞.

P r o o f. Assume contrary to (2) that P

 P

n i=1

a

ni

Y

i

P

n

i=1

a

ni

≥ ε



does not converge to 0 as n → ∞ for (a

ni

) ∈ Φ({b

n

}). Then by the Chebyshev inequality

P

 P

n i=1

a

ni

Y

i

P

n

i=1

a

ni

≥ ε



≤ E[( P

n

i=1

a

ni

Y

i

)

2

] ε

2

( P

n

i=1

a

ni

)

2

= P

n

i=1

(a

ni

)

2

(EY

i

)

2

ε

2

( P

n

i=1

a

ni

)

2

≤ M

ε

2

P

n

i=1

a

ni

≤ M b

n

ε

2

→ 0 and we have a contradiction.

Let e u

i

∈ U , i = 0, 1, . . . , be a sequence equidistributed in U . Given a sequence h

n

& 0, for u = [u

1

, . . . , u

k

] ∈ U and e u

ij

being the jth coordinate of e u

i

define

(3) F

n

(u) =

n

X

i=0 k

Y

j=1

1

{|uj

ueij|≤hn}

.

In what follows we shall assume that h

n

is chosen such that

(4) f

n

:= min

u∈U

F

n

(u) → ∞

as n → ∞. In particular, we can choose for e u

i

successively the centers

of cubes with edges of length 1/2

j

which cover the set U = [0, 1]

k

for

j = 0, 1, . . . and consider the sequences u e

i

and h

i

of the form

(3)

e u

0

= 

1

2

, . . . ,

12

, h

0

=

12

, e u

1

= 

1

4

,

14

, . . . ,

14

, h

1

= . . . = h

2k

=

13

, e u

2

= 

3

4

,

14

, . . . ,

14

, .. .

e u

2k

= 

3

4

,

34

, . . . ,

34

, u e

2k+1

= 

1

8

,

18

, . . . ,

18

, h

2k+1

= . . . = h

2k+4k

=

14

, .. .

e u

2k+4k

= 

7

8

,

78

, . . . ,

78

, u e

2k+4k+1

= 

1

16

,

161

, . . . ,

161

, h

2k+4k+1

= . . . = h

2k+4k+8k

=

15

, .. .

u e

2k+4k+8k

= 

15

16

,

1516

, . . . ,

1516

, and so on.

In the theory of adaptive control of Markov processes the number of fea- sible procedures is very limited (see the papers [2], [4] and [5]). A recursive self-tuning algorithm proposed in [2] is based on asymptotic properties of ordinary differential equations and requires differentiability of the transition operator with respect to an unknown parameter. Other methods, in which discretized MLE ([4]) or the theory of large deviation ([5]) are used, require the construction of a finite class of ε-optimal controls, which is usually a hard problem.

In this paper we propose an alternative approach based on a window nonparametric estimation used for multiarmed bandit problems in [1]. As- suming that our model is uniformly ergodic (assumptions (5) and (13)), although we do not know the transition probabilities, it appears that we are able to construct an adaptive procedure for which we obtain self-optimality.

The paper consists of three sections. In Section 2 we introduce an adap- tive procedure based on nonparametric estimation of the cost functional. In Section 3 another procedure that is based on nonparametric estimation of the transition kernel is considered.

2. Adaptive control with cost estimation. Assume there exists a uniformly positive recurrent state e ∈ E of the Markov process X in the sense that

(5) sup

u∈U

E

eu

2

} < ∞, where

(6) τ = inf{i > 0 : x

i

= e}.

(4)

Note that the above property holds in particular when inf

v∈[0,1]

inf

j∈E

p

v

(j, e) > 0.

Let

(7) τ

1

= τ, . . . , τ

n+1

= τ

n

+ τ · Θ

τn

with τ defined in (6) and Θ

τ

being the Markov shift operator. In other words τ

n

are the moments of successive returns to the recurrent state e.

Assume now that in the time interval [τ

i

, τ

i+1

) we use a Markov con- trol u e

i

. For u ∈ U define

(8) G

n

(u) =

n

X

i=0 k

Y

j=1

1

{|uj

ueij|≤hn} τi+1−1

X

r=τi

c(x

r

, u e

i

(x

r

)) and

(9) H

n

(u) =

n

X

i=0 k

Y

j=1

1

{|uj

ueij|≤hn}

i+1

− τ

i

).

Notice that G

n

(u) is the total cost incurred in the time interval [0, τ

n+1

), when the control u e

i

from the closed ball with center u and radius h

n

is used.

Similarly H

n

(u) is the total time during which the control from the sequence ( u e

i

) lies in the closed ball with center u and radius h

n

.

Proposition 2. We have

(10) sup

u∈U

G

n

(u)

F

n

(u) − E

u

n

τ −1

X

i=0

c(x

i

, u(x

i

)) o

→ 0 and

(11) sup

u∈U

H

n

(u)

F

n

(u) − E

u

{τ }

→ 0 in probability as n → ∞, and consequently

(12) sup

u∈U

G

n

(u) H

n

(u) − X

η∈E

c(η, u(η))π

u

(η)

→ 0

in probability as n → ∞, where π

u

is the unique invariant measure corre- sponding to the Markov process X with Markov control u.

P r o o f. Let Y

i

=

τi−1

X

r=τi−1

c(x

r

, e u

i

(x

r

)) − E

euei

n

τ −1

X

r=0

c(x

r

, e u

i

(x

r

)) o ,

with τ

0

= 0. Clearly E[Y

i+1

| Y

1

, . . . , Y

i

] = 0 and from the boundedness of

c(·, ·) and (5) we have sup

i

E

e

Y

i2

< ∞.

(5)

Consequently, from Proposition 1,

| P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

( P

τi−1

r=τi−1

c(x

r

, u e

i

(x

r

))−E

euei

{ P

τ −1

r=0

c(x

r

, u e

i

(x

r

))})|

F

n

(u)

converges to 0 in probability as n → ∞, uniformly in u ∈ U .

Note that under (5), by continuity of p

v

(e, j) with respect to v, the mapping

U 3 u 7→ E

eu

n

τ −1

X

r=0

c(x

r

, u e

i

(x

r

)) o

,

where U is endowed with the Euclidean norm, is continuous. Therefore, since h

n

→ 0, we obtain

sup

u∈U

G

n

(u)

F

n

(u) − E

eu

n

τ −1

X

r=0

c(x

r

, u(x

r

)) o

→ 0

in probability, which completes the proof of (10). The proof of (11) is similar.

We simply let c(·, ·) ≡ 1 in the previous considerations. The convergence (12) follows directly from (10) and (11) upon noticing that

E

eu

{ P

τ −1

r=0

c(x

r

, u(x

r

))}

E

u

{τ } = X

η∈E

c(η, u(η))π

u

(η),

where the existence of a unique invariant measure π

u

and its form are guar- anteed by assumption (5).

We are now in a position to formulate our first control procedure:

For a given ε > 0 find a positive integer n

ε

such that P

 sup

u∈U

G

n

(u) H

n

(u) − X

η∈E

c(η, u(η))π

u

(η)

≥ ε



≤ ε kck for n ≥ n

ε

with k · k standing for the supremum norm.

For the first n

ε

cycles, i.e. until time τ

nε+1

, test controls from the se- quence e u

i

using the Markov controls u e

i

in the time intervals [τ

i

, τ

i+1

). At time τ

nε+1

find a control u

δ

that is δ-optimal for G

nε(u)

/H

nε

(u), i.e. such that

G

nε

(u

δ

) H

nε

(u

δ

) ≤ inf

u∈U

G

nε

(u) H

nε

(u) + δ,

and use this control function for each i ≥ τ

nε+1

. Denote the above control

procedure by V

c

.

(6)

Theorem 1. We have J

x

(V

c

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

u

(η) i

+ 3ε + δ.

P r o o f. Because of the form of the cost functional (1) and boundedness of the cost function c, only controls after time τ

nε+1

have an effect on the value of the cost functional J and therefore

J

x

(V

c

) = E

x

h X

η∈E

c(η, u

δ

(η))π

uδ

(η) i . Let

B =

 sup

u∈U

G

nε

(u) H

nε

(u) − X

η∈E

c(η, u(η))π

u

(η)

< ε

 . For ω ∈ B have

X

η∈E

c(η, u

δ

(η))π

uδ

(η) ≤ ε + G

nε

(u

δ

)

H

nε

(u

δ

) ≤ ε + δ + inf

u∈U

G

nε

(u) H

nε

(u)

≤ inf

u∈U

h X

η∈E

c(η, u(η))π

u

(η) i

+ 2ε + δ.

Consequently, J

x

(V

c

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

u

(η) i

+ 2ε + δ + inf

u∈U

P [1

Bc

kck]

≤ inf

u∈U

h X

η∈E

c(η, u(η))π

u

(η) i

+ 3ε + δ, which completes the proof.

3. Adaptive control procedure with transition probability es- timation. In this section we estimate the transition probability function p

u

(i, j). Assume now that the Markov process X = (x

i

) is controlled using the sequence u e

i

at time i. For l, l

0

∈ E let (cf. (3) and (8))

G

l,ln0

(u) =

n

X

i=0 k

Y

j=1

1

{|uj

ueij|≤hn}

1

l

(x

i

)1

l0

(x

i+1

) and

F

nl

(u) =

n

X

i=0 k

Y

j=1

1

{|uj

ueij|≤hn}

1

l

(x

i

).

(7)

Assume now that there is κ > 0 such that for l, l

0

∈ E,

(13) inf

v∈[0,1]

p

v

(l, l

0

) > κ.

Proposition 3. We have sup

u∈U

G

l,ln0

(u)

F

nl

(u) − p(l, l

0

, u

l

)

→ 0 in probability as n → ∞, where p(l, l

0

, v) := p

v

(l, l

0

).

P r o o f. Let

Y

i

= 1

l

(x

i

)1

l0

(x

i+1

) − 1

l

(x

i

)p(l, l

0

, e u

il

).

Clearly E[Y

i

| Y

1

, . . . , Y

i−1

] = 0. Therefore by Proposition 1,

(14) sup

u∈U

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

Y

i

F

n

(u) → 0

in probability as n → ∞. Using Proposition 1 again with Y

i0

= 1

l

(x

i+1

) − p(x

i

, l, u e

xi

) we obtain

sup

u∈U

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

Y

i0

F

n

(u) → 0.

Therefore by (13) for κ > ε > 0 we have P



u∈U

inf F

nl

(u)

F

n

(u) > κ − ε



≤ P

 sup

u∈U

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

Y

i0

F

n

(u) ≥ ε



→ 0 in probability as n → ∞, and from (14) we obtain

sup

u∈U

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

Y

i

F

nl

(u) → 0 in probability as n → ∞. Hence

sup

u∈U

G

l,ln0

(u) F

nl

(u) −

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

1

l

(x

i

)p(l, l

0

, u e

il

) F

nl

(u)

→ 0 in probability as n → ∞ and by continuity of p(l, l

0

, v) with respect to v we finally obtain

P

n i=0

Q

k

j=1

1

{|uj

ueij|≤hn}

1

l

(x

i

)p(l, l

0

, u e

il

)

F

nl

(u) → p(l, l

0

, u

l

)

uniformly in u ∈ U as n → ∞, which completes the proof.

(8)

Our second adaptive procedure consists of the following steps:

1. For a given ε > 0 find n

ε

such that for n ≥ n

ε

and l, l

0

∈ E, P

 sup

u∈U

G

l,ln0

(u)

F

nl

(u) − p(l, l

0

, u

l

)

> ε



≤ ε

kckk

2

.

2. At time n

ε

normalize G

l,lnε0

(u)/F

nl

(u), i.e. form a new transition matrix

p(l, l e

0

, u) = G

l,ln0

(u) P

k

r=1

G

l,rn

(u) .

3. Then find an invariant measure π

uε

for the transition matrix p(l, l e

0

, u) and determine u

δ

such that

X

η

c(η, u

δ

(η))π

εuδ

(η) ≤ inf

u∈U

X

η

c(η, u

δ

(η))π

εu

(η) + δ.

4. Starting from time n

ε

use the control function u

δ

. Denote the above control procedure by V

p

.

Theorem 2. Under (13) we have J (V

p

) ≤ inf

u∈U

h X

η∈E

c(η, u(η))π

u

(η) i

+ kck (1 + k)ε

(1 − kε)κ + δ + ε, where π

u

is the unique invariant measure corresponding to p(l, l

0

, u

l

).

P r o o f. If for each l, l

0

∈ E, sup

u∈U

G

l,ln0

(u)

F

nl

(u) − p(l, l

0

, u

l

)

≤ ε then we have

| p(l, l e

0

, u) − p(l, l

0

, u

l

)| = F

nl

(u) P

k

r=1

G

l,rn

(u)

G

l,ln0

(u)

F

nl

(u) − p(l, l

0

, u

l

)

n

X

l0=1

G

l,ln0

(u) F

nl

(u)



ε + p(l, l

0

, u

l

)

1 −

n

X

l0=1

G

l,ln0

(u) F

nl

(u)

 1

1 − kε

≤ (1 + k)ε 1 1 − kε .

From Theorem and Corollary 2 of [7] under (13) we see that for l ∈ E, sup

u∈U

uε

(l) − π

u

(l)| ≤ 1

2 · (1 + k)ε (1 − kε)κ . Therefore for ω ∈ T

k

l=1

T

k

l0=1

B

ll0

, where

(9)

B

ll0

=

 sup

u∈U

G

l,ln0

(u)

F

nl

(u) − p(l, l

0

, u

l

)

≤ ε

 , we have

X

η∈E

c(η, u(η))π

εuδ

(η) ≤ inf

u∈U

X

η∈E

c(η, u(η))π

u

(η) + kck (1 + k)ε (1 − kε)κ + δ.

Consequently,

J (V

p

) = E h X

η∈E

c(η, u(η))π

εuδ

(η) i

≤ inf

u∈U

X

η∈E

c(η, u(η))π

u

(η) + kck (1 + k)ε (1 − kε)κ + δ + kckP 

Ω\

k

\

l=1 k

\

l0=1

B

ll0



≤ inf

u∈U

X

η∈E

c(η, u(η))π

u

(η) + kck (1 + k)ε

(1 − kε)κ + δ + ε, which completes the proof.

Remark. The adaptive procedures introduced in Sections 2 and 3 allow one to determine a nearly optimal control in a finite time. Using forcing and an increasing decision horizon from Section 3 of [3] for both adaptive procedures it is possible to construct optimal adaptive strategies.

References

[1] R. A g r a v a l, The continuum-armed bandit problem, SIAM J. Control Optim. 33 (1995), 1926–1951.

[2] V. S. B o r k a r, Recursive self-tuning of finite Markov chains, Appl. Math. (Warsaw) 24 (1996), 169–188.

[3] E. D r a b i k, On nearly selfoptimizing strategies for multiarmed bandit problems with controlled arms, ibid. 23 (1996), 449–473.

[4] T. D u n c a n, B. P a s i k - D u n c a n and L. S t e t t n e r, Discretized maximum likelihood and almost optimal adaptive control of ergodic adaptive models, SIAM J. Control Optim. 36 (1998), 422–446.

[5] —, —, —, Adaptive control of discrete Markov processes by the method of large deviations, in: Proc. 35th IEEE CDC, Kobe 1996, IEEE, 360–365.

[6] O. H e r n ´ a n d e z - L e r m a and R. C a v a z o s - C a d e n a, Density estimation and adap- tive control of Markov processes; average and discounted criteria, Acta Appl. Math.

20 (1990), 285–307.

(10)

[7] A. N o w a k, A generalization of Ueno’s inequality for n-step transition probabilities, Appl. Math. (Warsaw) 25 (1998), 295–299.

Ewa Drabik

Faculty of Economics University of Bia lystok Warszawska 63

15-062 Bia lystok, Poland

Lukasz Stettner Institute of Mathematics Polish Academy of Sciences Sniadeckich 8 ´ 00-950 Warszawa, Poland E-mail: stettner@impan.gov.pl and Warsaw School of Management and Marketing

Received on 13.11.1998;

revised version on 27.8.1999

Cytaty

Powiązane dokumenty

Oświadczam, że projekt przebudowy drogi powiatowej w miejscowości Aleksandrów gmina Jakubów został sporządzony zgodnie z obowiązującymi przepisami oraz

Kiedy pracowaliśmy nad strategią szpitala, zastanawialiśmy się, czy nie jest pewną sprzecznością, że chcemy skupić się i na onkologii, i na transplantologii, ponieważ te

Podczas gali w Zamku Królewskim w Warszawie, w której uczestniczyli prezes NFZ Agnieszka Pachciarz, wiceminister zdrowia Sławomir Neumann i wiceminister obrony narodowej

Wykorzystuje w pracy narzędzia, aplikacje i programy do komunikacji.

23 P. van den Bosche, in Search of remedies for non-Compliance: The experience of the european Community, „Maastricht Journal of European and Comparative Law” 1996, t.

– w świetle zatem prawdy formalnej w przypadku wydania wyroku zaocznego, zgodność z prawdziwym stanem rzeczy oznacza zgodność z materiałem znajdującym się w aktach sprawy,

Zasadniczo powiela ona rozwiązania wcześniejszej ustawy z 1 r., ale uwzględnia także rozwiązania ustawodawstwa krajowego (w tym jeden z typów pozwoleń wodnoprawnych,

W konsekwencji człowiek nie może (i nie powinien próbować) uwolnić się od swojej fizyczno- ści. Jest przede wszystkim bytem somatycznym, który zaspokoić musi konkret- ne