L. S T E T T N E R (Warszawa)
ERGODIC CONTROL OF PARTIALLY OBSERVED MARKOV PROCESSES WITH EQUIVALENT
TRANSITION PROBABILITIES
Abstract. Optimal control with long run average cost functional of a partially observed Markov process is considered. Under the assumption that the transition probabilities are equivalent, the existence of the solution to the Bellman equation is shown, with the use of which optimal strategies are constructed.
1. Introduction. Let (Ω, F , P ) be a probability space and (x
n) a discrete time controlled Markov process on a compact state space E, en- dowed with the Borel σ-field E , with transition kernel P
v(x, dz) for v ∈ U , where (U, U ) is a compact space of control parameters. Assume the only observations of x
nare R
d-valued random variables y
1, . . . , y
nsuch that for Y
n= σ{y
1, . . . , y
n} we have
(1) P {y
n+1∈ A | x
n+1, Y
n} = P {y
n+1∈ A | x
n+1} = R
A
r(x
n+1, y) dy for n = 0, 1, . . . with r : E×R
d→ R
+a measurable function, and A ∈ B(R
d), the family of Borel subsets of R
d.
The Markov process (x
n) is controlled by a sequence (a
n) of Y
n-measur- able U -valued random variables. The best mean square approximation of x
nbased on the available observation is given by a filtering process π
n, defined as a measure valued process such that for A ∈ E ,
(2) π
n(A) = P {x
n∈ A | Y
n} for n = 1, 2, . . . ,
and π
0(A) = µ(A)
where µ is the initial law of (x
n).
1991 Mathematics Subject Classification: Primary 93E20; Secondary 93E11.
Key words and phrases: stochastic control, partial observation, long run average cost, Bellman equation.
The following lemma gives the most general formula for π
n. Its proof, unlike those in [5] and [8], which have more restrictive hypotheses, is not based on the reference probability method.
Lemma 1. Under (1), for n = 0, 1, . . . and A ∈ E we have (3) π
n+1(A) =
R
A
r(z
2, y
n+1) R
E
P
an(z
1, dz
2) π
n(dz
1) R
E
r(z
2, y
n+1) R
E
P
an(z
1, dz
2) π
n(dz
1) .
P r o o f. Denote the right hand side of (3) by M
an(y
n+1, π
n)(A). Let F : (R
d)
n→ R be a bounded measurable function, Y
n= (y
1, . . . , y
n) and C ∈ B(R
d).
By (1), Fubini’s theorem and properties of conditional expectations we have
R
Ω
M
an(y
n+1, π
n)(A)χ
C(y
n+1)F (Y
n) dP
= R
Ω
E[M
an(y
n+1, π
n)(A)χ
C(y
n+1) | x
n+1, Y
n]F (Y
n) dP
= R
Ω
R
C
M
an(y, π
n)(A)r(x
n+1, y) dy F (Y
n) dP
= R
Ω
R
C
M
an(y, π
n)(A)E[E[r(x
n+1, y) | Y
n, x
n] | Y
n] dy F (Y
n) dP
= R
Ω
R
C
M
an(y, π
n)(A)E h R
E
r(z, y) P
an(x
n, dz) Y
ni
dy F (Y
n) dP
= R
Ω
R
C
M
an(y, π
n)(A) R
E
R
E
r(z, y) P
an(z
1, dz) π
n(dz
1) dy F (Y
n) dP
= R
Ω
R
C
R
A
r(z
2, y) R
E
P
an(z
1, dz
2) π
n(dz
1) dy F (Y
n) dP
= R
Ω
R
E
R
A
R
C
r(z
2, y) dy P
an(z
1, dz
2) π
n(dz
1) F (Y
n) dP
= R
Ω
E h R
A
R
C
r(z
2, y) dy P
an(x
n, dz
2) Y
ni
F (Y
n) dP
= R
Ω
E h
E h R
C
r(x
n+1, y) dy χ
A(x
n+1) Y
n, x
ni Y
ni
F (Y
n) dP
= R
Ω
R
C
r(x
n+1, y) dy χ
A(x
n+1)F (Y
n) dP
= R
Ω
E[χ
C(y
n+1) | Y
n, x
n+1]χ
A(x
n+1)F (Y
n) dP
= R
Ω
χ
C(y
n+1)χ
A(x
n+1)F (Y
n) dP = R
Ω
π
n+1(A)χ
C(y
n+1)F (Y
n) dP.
Therefore, by the definition of conditional expectation, (3) follows.
The class of controls a
n= u(π
n), where u is a fixed measurable, U -valued function, is of special interest. Namely, we have
Lemma 2. Under (1), if additionally a
n= u(π
n) with u a fixed measur- able function from the space P(E) of probability measures on E, endowed with the topology of weak convergence, into (U, U ), then π
nis a Y
n-Markov process with transition operator
(4) Π
u(ν)(ν, F ) = R
E
R
Rd
F (M
u(ν)(y, ν))r(z, y) dy R
E
P
u(ν)(z
1, dz) ν(dz
1)
where
(5) M
v(y, ν)(A) = R
A
r(z, y) R
E
P
v(z
1, dz) ν(dz
1) R
E
r(z, y) R
E
P
v(z
1, dz) ν(dz
1) for v ∈ U , ν ∈ P(E) and F : P(E) → R bounded measurable.
P r o o f. By (1) we easily obtain E[F (π
n+1) | Y
n]
= E[F (M
u(πn)(y
n+1, π
n)) | Y
n]
= E[E[F (M
u(πn)(y
n+1, π
n)) | Y
n, x
n+1] | Y
n]
= E h R
Rd
F (M
u(πn)(y, π
n))r(x
n+1, y) dy Y
ni
= E h R
Rd
E[F (M
u(πn)(y, π
n))r(x
n+1, y) | Y
n, x
n] dy Y
ni
= E h R
Rd
R
E
F (M
u(πn)(y, π
n))r(z, y)P
u(πn)(x
n, dz) dy Y
ni
= R
E
R
Rd
F (M
u(πn)(y, π
n)) R
E
r(z, y) P
u(πn)(z
1, dz) π
n(dz
1) dy
= Π
u(πn)(π
n, F ) .
Thus (π
n) is Markov with transition operator of the form (4).
In this paper we are interested in minimizing the following long run average cost functional:
(6) J
µ((a
n)) = lim sup
n→∞
n
−1E
µn
n−1X
i=0
c(x
i, a
i) o
over all U -valued, Y
n-adapted processes a
n, with c : E × U → R
+a given bounded measurable cost function.
By the very definition of a filtering process we have (7) J
µ((a
n)) = lim sup
n→∞
n
−1E
µn
n−1X
i=0
R
E
c(z, a
i) π
i(dz) o
.
The optimal strategies for the cost functional J
µare constructed with the use of a suitable Bellman equation, the solution of which is found as a limit of w
β(x) = ϑ
β(x) − inf
z∈Eϑ
β(z) as β → 1, where ϑ
βis the value function of the β-discounted cost functional. Since our limit results are based on compactness arguments, obtained via the Ascoli–Arzel` a theorem, in Section 2 we show the continuity of ϑ
β. Then in Section 3 we prove the uniform boundedness of w
β. Using the concavity of w
β, obtained from the concavity of ϑ
β, proved in Section 2, we get equicontinuity of w
β, which allows us to use the Ascoli–Arzel` a theorem.
The discrete time ergodic optimal control problem with partial observa- tion was studied in [1], [2], [3], [6], [8], [9]. In [1] and [8] the observation was corrupted with white noise. In addition, in [1] there was a finite state space and a rich observation structure. In [8] the state space was general but there were some restrictions on controls. The papers [2] and [3] con- tain a general theory but the fundamental example used is a very simple maintenance-replacement model.
In [6] a model with a finite state space and almost steady state transition probabilities was studied. Finally, finite state space semi-Markov decision processes with a completely observable state were considered in [9]. Our paper generalizes [6] in various directions. Namely, we have a general, com- pact state space. Although the techniques to show the boundedness and the equicontinuity of w
βfollow in some sense the arguments of [6], by a more detailed estimation we obtain the results under the assumptions which are much less restrictive than the corresponding ones in [6], even when E is finite.
2. Discounted control problem. In this section we characterize the value function ϑ
βof the discounted cost functional J
µβdefined as follows:
(8) J
µβ((a
n))
def= E
µn X
∞i=0
β
ic(x
i, a
i) o
= E
µn X
∞i=0
β
iR
E
c(z, a
i) π
i(dz) o
with β ∈ (0, 1).
The theorem below provides a complete solution to the discounted par- tially observed control problem.
Theorem 1. Assume (1) and (A1) c : E × U → R
+is continuous,
(H1) for F ∈ C(P(E)), the space of continuous functions on P(E), if µ
n⇒ µ, i.e. µ
nconverges weakly in P(E) to µ, we have
(9) sup
a∈U
|Π
a(µ
n, F ) − Π
a(µ, F )| → 0 as n → ∞, (H2) for F ∈ C(P(E)), if U 3 a
n→ a we have
(10) Π
an(µ, F ) → Π
a(µ, F ).
Then
(11) ϑ
β(µ)
def= inf
(an)
J
µβ((a
n))
is a continuous function of µ ∈ P(E) and is a unique solution to the Bellman equation
(12) ϑ
β(µ) = inf
a∈U
h R
E
c(x, a) µ(dx) + βΠ
a(µ, ϑ
β) i .
There exists a measurable selector u
β: P(E) → (U, U ) for which the infimum on the right hand side of (12) is attained. Moreover , we have
(13) ϑ
β(µ) = J
µβ((u
β(π
n))) .
In addition, ϑ
βcan be uniformly approximated from below by the sequence
(14)
ϑ
β0(µ) ≡ 0 , ϑ
βn+1(µ) = inf
a∈U
h R
E
c(x, a) µ(dx) + βΠ
a(µ, ϑ
βn) i
, and each ϑ
βnis concave, i.e. for µ, ν ∈ P(E) and α ∈ [0, 1], (15) ϑ
βn(αµ + (1 − α)ν) ≥ αϑ
βn(µ) + (1 − α)ϑ
βn(ν).
P r o o f. We only point out the main steps since the proof is more or less standard (for details see [4], Thm. 2.2).
Define, for ϑ ∈ C(P(E)), T ϑ(µ) = inf
a∈U
h R
E
c(x, a) µ(dx) + βΠ
a(µ, ϑ) i
.
By (A1) and (H1), T is a contraction on C(P(E)). Thus, by the Banach
principle there is a unique fixed point ϑ
βof T , which is a unique solution to
the Bellman equation (12). Since by (A1) and (H2) the map U 3 a → R
E
c(x, a) µ(dx) + βΠ
a(µ, ϑ
β)
is continuous, there exists a measurable selector u
β. The identity (13) is then almost immediate. Since T is monotonic and contractive, ϑ
βnis increasing and converges to ϑ
β. It remains to show the concavity of ϑ
βn. We prove this by induction. Clearly, ϑ
β0≡ 0 is concave. Provided ϑ
βnis concave, by Jensen’s lemma we have for α ∈ (0, 1),
Π
a(αµ + (1 − α)ν, ϑ
βn) ≥ αΠ
a(µ, ϑ
βn) + (1 − α)Π
a(ν, ϑ
β) and therefore from (14),
ϑ
βn+1(αµ + (1 − α)ν) ≥ αϑ
βn+1(µ) + (1 − α)ϑ
βn+1(ν) ,
i.e. ϑ
βn+1is concave. By induction, ϑ
βnis concave for each n. The proof of the theorem is complete.
Below we formulate sufficient conditions for (H1) and (H2).
Proposition 1. Assume (A2) r ∈ C(E × R
d),
(A3) for fixed a ∈ U , P
a(x, · ) is Feller , i.e. for any ϕ ∈ C(E), if x
n⇒ x we have
(16) P
a(x
n, ϕ) → P
a(x, ϕ), (H3) if U 3 a
n→ a, then for each ϕ ∈ C(E),
(17) sup
x∈E
|P
an(x, ϕ) − P
a(x, ϕ)| → 0, (A4) for R(z, ψ)
def= R
Rd
r(z, y)ψ(y) dy where ψ ∈ C(R
d), if E 3 z
n→ z we have
(18) R(z
n, · ) ⇒ R(z, · ).
Then (H1) and (H2) are satisfied.
P r o o f. Notice first that from (16) and (17), if U 3 a
n→ a and µ
n⇒ µ, we have
(19) P
an(µ
n, ϕ)
def= R
E
P
an(x, ϕ) µ
n(dx) → P
a(µ, ϕ)
as n → ∞, for ϕ ∈ C(E) . Since U × P(E) is compact, to prove (H1) and (H2) it is sufficient to show that
U × P(E) 3 (a, µ) → Π
a(µ, F ) is continuous for F ∈ C(P(E)).
Therefore we shall show that
(20) Π
an(µ
n, F ) → Π
a(µ, F )
for U 3 a
n→ a, P(E) 3 µ
n⇒ µ and F ∈ C(P(E)). We have (21) |Π
an(µ
n, F ) − Π
a(µ, F )|
≤
R
E
R
Rd
(F (M
an(y, µ
n)) − F (M
a(y, µ)))r(z, y) dy P
an(µ
n, dz) +
R
E
R
Rd
F (M
a(y, µ))r(z, y) dy (P
an(µ
n, dz) − P
a(µ, dz))
= I
n+ II
n.
From (19), II
n→ 0, provided
(22) E 3 z → R
Rd
F (M
a(y, µ))r(z, y) dy ∈ C(E).
By (A4), R
d3 y → M
a(y, µ) ∈ P(E) is continuous. Then, again by (A4), the map (22) is continuous, and consequently II
n→ 0.
If
(23) sup
z∈E
R
Rd
(F (M
an(y, µ
n)) − F (M
a(y, µ)))r(z, y) dy → 0 then clearly I
n→ 0.
By (A4), for each ε > 0 there exists a compact set K ⊂ R
dsuch that for any z ∈ E,
(24) R(z, K
c) < ε
2kF k . Therefore
R
Rd
(F (M
an(y, µ
n)) − F (M
a(y, µ)))r(z, y) dy
≤ R
K
|F (M
an(y, µ
n)) − F (M
a(y, µ))|r(z, y) dy + ε
and to obtain (23) it remains to show that
(25) M
an(y, µ
n)(ϕ) → M
a(y, µ)(ϕ) for any ϕ ∈ C(E), uniformly in y ∈ K.
Using the Stone–Weierstrass approximation theorem (see [7], Thm. 9.28,
cf. also the proof of Lemma A.1.2 of [8]) and (19), we obtain
R
E
r(z, y)ϕ(z) R
E
P
an(z
1, dz) µ
n(dz
1) − R
E
r(z, y)ϕ(z) R
E
P
a(z
1, dz) µ(dz
1)
≤
R
E
r(z, y)ϕ(z)(P
an(µ
n, dz) − P
a(µ, dz)) → 0 uniformly in y ∈ K. Thus, we have uniform convergence of the numerators and denominators in the formula defining M
an, and consequently conver- gence of the ratios from which (25) follows.
The proof of Proposition 1 is complete.
R e m a r k 1. (A4) is satisfied when sup
z∈Er(z, y) is integrable.
Define
(26) w
β(ν) = ϑ
β(ν) − ϑ
β(µ
β) and w
βn(ν) = ϑ
βn(ν) − ϑ
βn(µ
nβ)
where µ
β= arg min ϑ
βand µ
nβ= arg min ϑ
βn. Clearly, w
βis a solution to the equation
(27) w
β(ν) + (1 − β)ϑ
β(µ
β) = inf
a∈U
h R
E
c(x, a) ν(dx) + βΠ
a(ν, w
β) i
and w
nβ(ν) → w
β(ν) uniformly in ν ∈ P(E). We would like to let β ↑ 1 in (27) and thus obtain a solution w(ν) to the long run average Bellman equation
(28) w(ν) + γ = inf
a∈U
h R
E
c(x, a) ν(dx) + Π
a(ν, w) i .
Since we wish to apply the Ascoli–Arzel` a theorem, we have to show the boundedness and the equicontinuity of w
βfor β ∈ (0, 1), which are studied successively in the next sections.
3. Boundedness of w
β. We make the following assumption:
(29) (A5) inf
z,z0∈E
inf
a,a0∈U
inf
C∈E, Pa(z,C)>0
P
a0(z
0, C) P
a(z, C)
def
= λ > 0.
We have
Proposition 2. Under (A5) and the assumptions of Theorem 1, the functions w
β(ν) are uniformly bounded for β ∈ (0, 1), ν ∈ P(E).
P r o o f. We improve the proof of Theorem 2 of [6]. Namely, we show by induction the uniform boundedness of w
βn(ν) for ν ∈ P(E), β ∈ (0, 1), n = 0, 1, . . . For n = 0, w
0β(ν) ≡ 0.
Assume that for any β ∈ (0, 1), ν ∈ P(E), w
nβ(ν) ≤ L where L ≥ kckλ
−2.
Let a, a
0∈ U be such that for fixed ν ∈ P(E), w
βn+1(ν) = R
E
c(x, a) ν(dx) − R
E
c(x, a
0) µ
n+1β(dx) (30)
+ β[Π
a(ν, ϑ
βn) − Π
a0(µ
n+1β, ϑ
βn)] . For y ∈ R
d, define
m(y)(B) = M
a0(y, µ
n+1β)(B) − λ
2M
a(y, ν)(B) for any B ∈ E .
By (29) we have
R
B
r(z, y) R
E
P
a0(z
1, dz) µ
n+1β(dz
1) ≥ λ R
B
r(z, y) R
E
P
a(z
1, dz) ν(dz
1)
= λM
a(y, ν)(B) R
E
r(z, y) R
E
P
a(z
1, dz) ν(dz
1)
≥ λ
2M
a(y, ν)(B) R
E
r(z, y) R
E
P
a0(z
1, dz) µ
n+1β(dz
1) and therefore m(y)(B) ≥ 0 for B ∈ E .
If λ = 1 we have a stationary, noncontrolled Markov chain with P
a(z, C)
= η(C) for any a ∈ U , z ∈ E and some fixed η ∈ P(E), and consequently w
βn≡ 0 for any n = 0, 1, . . . Therefore we restrict ourselves to the case λ < 1.
Then (1 − λ
2)
−1m(y) ∈ P(E). Since
M
a0(y, µ
n+1β) = λ
2M
a(y, ν) + (1 − λ
2)[(1 − λ
2)
−1m(y)] , by concavity of ϑ
βnwe obtain
(31) ϑ
βn(M
a0(y, µ
n+1β)) ≥ λ
2ϑ
βn(M
a(y, ν)) + (1 − λ
2)ϑ
βn((1 − λ
2)
−1m(y)) and from (30) we have
w
βn+1(ν) (32)
≤ kck + β R
E
R
Rd
ϑ
βn(M
a(y, µ))r(z, y) dy
× R
E
P
a(z
1, dz) ν(dz
1) − λ
2R
E
P
a0(z
1, dz) µ
n+1β(dz
1)
− β(1 − λ
2) R
E
R
Rd
ϑ
βn((1 − λ
2)
−1m(y))r(z, y) dy
× R
E
P
a0(z
1, dz)µ
n+1β(dz
1)
= kck + β R
E
R
Rd
(ϑ
βn(M
a(y, µ)) − ϑ
βn(µ
nβ))r(z, y) dy
× R
E
P
a(z
1, dz) ν(dz
1) − λ
2R
E
P
a0(z
1, dz) µ
n+1β(dz
1)
− β(1 − λ
2) R
E
R
Rd
(ϑ
βn((1 − λ
2)
−1m(y)) − ϑ
βn(µ
nβ))r(z, y) dy
× R
E
P
a0(z
1, dz) µ
n+1β(dz
1)
≤ kck + βL var R
E
P
a(z
1, · ) ν(dz
1) − λ
2R
E
P
a0(z
1, · ) µ
n+1β(dz
1)
.
By (A5) for any B ∈ E ,
(33) R
E
P
a(z
1, B) ν(dz
1) ≥ λ
2R
E
P
a0(z
1, B) µ
n+1β(dz
1) . Thus
(34) w
n+1β(ν) ≤ kck + βL(1 − λ
2) ≤ L
and the bound L is independent of ν ∈ P(E), β ∈ (0, 1). By induction w
βn(ν) ≤ L for any ν ∈ P(E), n = 0, 1, . . . , β ∈ (0, 1). Since by the very definition w
βn(ν) ≥ 0, and for each β, w
βn(ν) → w
β(ν) as n → ∞, we finally obtain w
β(ν) ≤ L for ν ∈ P(E) and β ∈ (0, 1).
R e m a r k 2. One can easily see that in the case of a finite state space E, the assumption
(35) (A5
0) inf
z,z0∈E
inf
a,a0∈U
inf
x∈E, Pa(z,x)>0
P
a0(z
0, x) P
a(z, x) > 0
also implies the boundedness of w
β. Thus Proposition 2 significantly im- proves Theorem 2 of [6]. This was possible because of the choice of µ
nβin (32) as the argument of minimum of ϑ
βn.
R e m a r k 3. Assumption (A5) says that the transition probabilities for different controls and initial states are mutually equivalent, with Radon–
Nikodym density bounded away from 0. In particular, in the case when P
a(z, C) = R
C
g
a(z, x) η(dx) the assumption
(36) inf
z,z0∈E
inf
a,a0∈U
inf
x∈E, ga(z,x)>0
g
a0(z
0, x)
g
a(z, x) > 0
is sufficient for (A5) to be satisfied.
4. Main theorem. Before we formulate and prove our main result, we show the equicontinuity of w
βfor β ∈ (0, 1). For this purpose we need an extra assumption:
(A6) If P(E) 3 µ
n⇒ µ ∈ P(E) then sup
a∈U
sup
C∈E
|P
a(µ
n, C) − P
a(µ, C)| → 0 with
P
a(µ, C)
def= R
E
P
a(x, C)µ(dx) . We have
Proposition 3. Under (A5), (A6) and the assumptions of Theorem 1, the family of functions w
β, β ∈ (0, 1), is equicontinuous, i.e.
(37) ∀
ε>0∃
δ>0∀
µ,µ0∈P(E)%(µ, µ
0) < δ ⇒ ∀
β∈(0,1)|w
β(µ) − w
β(µ
0)| < ε with % standing for a metric compatible with the weak convergence topology of P(E).
P r o o f. For ν, µ ∈ P(E) let (38) λ(ν, µ)
def= inf
a∈U
inf
C∈E,Pa(µ,C)>0
P
a(ν, C) P
a(µ, C) . From (A5) and (A6), if ν ⇒ µ, then
(39) λ(ν, µ) → 1 and λ(µ, ν) → 1.
By (27) for ν, µ ∈ P(E) we have (40) w
β(ν) − w
β(µ)
≤ sup
a∈U
R
E
c(x, a)(ν(dx) − µ(dx))
+ β sup
a∈U
(Π
a(ν, w
β) − Π
a(µ, w
β)).
By analogy with the proof of Proposition 2 define
m
a(y, µ, ν)(B) = M
a(y, µ)(B) − λ(µ, ν)λ(ν, µ)M
a(y, ν)(B) for B ∈ E .
Clearly, m
a(y, µ, ν)(B) ≥ 0 for B ∈ E , and λ(µ, ν)λ(ν, µ) ≤ 1.
If λ(µ, ν)λ(ν, µ) = 1, then w
β≡ 0 for β ∈ (0, 1), and consequently the equicontinuity property is satisfied. Therefore assume λ
2= λ(µ, ν)λ(ν, µ)
< 1. Then by the concavity of w
β,
(41) w
β(M
a(y, µ)) ≥ λ
2w
β(M
a(y, ν)) + (1 − λ
2)w
β((1 − λ
2)
−1m
a(y, µ, ν)) .
From (40),
(42) w
β(ν) − w
β(µ) ≤ sup
a∈U
R
E
c(x, a)(ν(dx) − µ(dx)) + β sup
a∈U
n R
E
R
Rd
w
β(M
a(y, ν))r(z, y) dy (P
a(ν, dz) − λ
2P
a(µ, dz))
+ R
E
R
Rd
(λ
2w
β(M
a(y, ν)) − w
β(M
a(y, µ)))r(z, y) dy P
a(µ, dz) o
= I + II + III . Now
II ≤ 2kw
βk sup
a∈U
sup
B∈E
|P
a(ν, B) − λ
2P
a(µ, B)|
(43)
= 2kw
βk(1 − λ(µ, ν)λ(ν, µ)) and using (41) and the nonnegativity of w
βwe have (44) III
≤ sup
a∈U
R
E
R
Rd
(λ
2− 1)w
β((1 − λ
2)
−1m
a(y, µ, ν))r(z, y) dy P
a(µ, dz) ≤ 0 .
Interchanging ν and µ in (40)–(44) we obtain the same estimates and there- fore
|w
β(ν) − w
β(µ)| ≤ sup
a∈U
R
E
c(x, a)(ν(dx) − µ(dx)) (45)
+ 2kw
βk(1 − λ(µ, ν)λ(ν, µ)) .
Since by the Stone–Weierstrass theorem (Thm. 9.28 of [7]) c(x, a) can be uniformly approximated on E × U by continuous functions of the form P
ri=1
c
i(x)d
i(a), from (39) we obtain
ν⇒µ
lim sup
β∈(0,1)
|w
β(ν) − w
β(µ)| = 0 . Let us comment on the assumption (A6):
R e m a r k 4. (H3) clearly follows from (A6).
R e m a r k 5. In the case of a finite state space E = {1, . . . , N }, (A6) can be written as
(46) sup
a∈U N
X
k=1
N
X
i=1
(s
ni− s
i)P
a(i, k)
→ 0
for s
n= (s
n1, . . . , s
nN) → s = (s
1, . . . , s
N), 0 ≤ s
ni≤ 1, 0 ≤ s
i≤ 1, P s
ni= 1, P s
i= 1, and this is satisfied since
sup
a∈U N
X
k=1
N
X
i=1
(s
ni− s
i)P
a(i, k) ≤
N
X
i=1
|s
ni− s
i| → 0 as s
n→ s . R e m a r k 6. Assume P
a(z, C) = R
C
g
a(z, x)η(dx) for C ∈ E and that the mapping
(47) U × E × E 3 (a, z, x) → g
a(z, x) is continuous.
Then (A6) is satisfied.
In fact, by the Stone–Weierstrass theorem we can approximate g
auni- formly on U ×E×E by continuous functions of the form P
ki=1
b
i(a)c
i(z)d
i(x) and
sup
a∈U
sup
C∈E
|P
a(µ
n, C) − P
a(µ, C)|
≤ sup
a∈U
R
E
R
E
g
a(z, x)(µ
n(dz) − µ(dz)) η(dx)
≤ ε +
k
X
i=1
sup
a∈U
|b
i(a)| R
E
|d
i(x)|η(dx)
R
E
c
i(z)(µ
n(dz) − µ(dz)) → ε as n → ∞.
Now we can prove our main result:
Theorem 2. Assume (A1)–(A6). Then there exist w ∈ C(P(E)) and a constant γ which are solutions to the Bellman equation
(48) w(µ) + γ = inf
a∈U
h R
E
c(x, a) µ(dx) + Π
a(µ, w) i .
Moreover , there exists u : P(E) → U for which the infimum on the right hand side of (48) is attained. The strategy a
n= u(π
n) is optimal for J
µand
(49) J
µ((u(π
n))) = γ .
P r o o f. By Theorem 1, each ϑ
βnis concave. Therefore w
βnis concave and w
βas limit of w
nβis also concave. Since by Proposition 2, the w
βare uniformly bounded, and by Proposition 3 equicontinuous, from the Ascoli–
Arzel` a theorem the family w
β, β ∈ (0, 1), is relatively compact in C(P(E)).
Moreover, |(1 − β)ϑ
β(µ
β)| ≤ kck. Therefore one can choose a subsequence β
k→ 1 such that
(1 − β
k)ϑ
βk(µ
βk) → γ and
w
βk→ w in C(P(E)) as k → ∞.
Letting β
k→ 1 in (27) we obtain (48).
The remaining assertion of the theorem follows easily from Theorem 3.2.2 of [4].
References
[1] G. B. D i M a s i and L. S t e t t n e r, On adaptive control of a partially observed Markov chain, Applicationes Math., to appear.
[2] E. F e r n a n d e z - G a u c h e r a n d, A. A r a p o s t a t i s and S. J. M a r c u s, Adaptive con- trol of a partially observed controlled Markov chain, in: Stochastic Theory and Adap- tive Control, T. E. Duncan and B. Pasik-Duncan (eds.), Lecture Notes in Control and Inform. Sci. 184, Springer, 1992, 161–171.
[3] —, —, —, On partially observable Markov decision processes with an average cost criterion, Proc. 28 CDC, Tampa, Florida, 1989, 1267–1272.
[4] O. H e r n a n d e z - L e r m a, Adaptive Markov Control Processes, Springer, New York, 1989.
[5] H. K o r e z l i o g l u and G. M a z z i o t t o, Estimation recursive en transmission nu- merique, Proc. Neuvi`eme Colloque sur le Traitement du Signal et ses Applications, Nice, 1983.
[6] M. K u r a n o, On the existence of an optimal stationary J -policy in non-discounted Markovian decision processes with incomplete state information, Bull. Math. Statist.
17 (1977), 75–81.
[7] H. L. R o y d e n, Real Analysis, Macmillan, New York, 1968.
[8] W. J. R u n g g a l d i e r and L. S t e t t n e r, Nearly optimal controls for stochastic ergodic problems with partial observation, SIAM J. Control Optim. 31 (1993), 180–218.
[9] K. W a k u t a, Semi-Markov decision processes with incomplete state observation—
average cost criterion, J. Oper. Res. Soc. Japan 24 (1981), 95–108.
LUKASZ STETTNER
INSTITUTE OF MATHEMATICS POLISH ACADEMY OF SCIENCES P.O. BOX 137
00-950 WARSZAWA, POLAND
E-mail: STETTNER@IMPAN.IMPAN.GOV.PL
Received on 22.6.1992