• Nie Znaleziono Wyników

Nonautonomous dynamical systems in stochastic global optimization

N/A
N/A
Protected

Academic year: 2022

Share "Nonautonomous dynamical systems in stochastic global optimization"

Copied!
49
0
0

Pełen tekst

(1)

Faculty of Mathematics and Informatics Institute of Mathematics

Dawid Tar lowski

Nonautonomous Dynamical Systems in Stochastic Global

Optimization

Phd Thesis

written under the supervision of prof. dr hab. Jerzy Ombach

Krak´ow 2014

(2)

Contents

1. Introduction 1

2. Weak convergence of Borel probability measures 4

3. Some Equivalences for Global Convergence 6

4. Main Result and Consequences 10

5. Weak Convergence of Borel Probability Measures 19 6. Some Concepts of Dynamical Systems in Metric Spaces 22

7. Proof of Main Result 27

8. Grenade Explosion Method 32

9. The Evolution Strategy (µ/ρ + λ) 35

10. Simulated Annealing 38

11. Accelerated Random Serch 42

12. Appendix 44

References 46

1. Introduction

Let (A, d) be a separable metric space and let f : A → R be a continuous function having its global minimum min f . We assume that min f = 0. Let

A?= {x ∈ A : f (x) = 0}.

In the context of optimization the function f is often called a problem func- tion and the elements of A? are often called the solutions of the global minimization problem. There is a number of iterative numerical techniques designed for finding an element from A?. Under some assumptions on the function f like, for example, the differentiability, deterministic optimization techniques [24] can be used for solving optimization problems. However, global minima are usually hard to locate. Stochastic optimization techniques [43, 42, 33] are usually not dependent on the smoothness of the function. At the same time, if properly configured, these methods can be very effective in finding global minimums. There is great variety of available techniques, among them we have genetic and evolutionary algorithms [38, 37, 7, 36], Sim- ulated Annealing (SA) [6, 41, 21, 3] or swarm intelligence algorithms like Particle Swarm Optimization (PSO) [12, 11], Artificial Bee Colony (ABC) [17] or Ant Colony Optimization (ACO) [15]. Grenade Explosion Method (GEM) [2, 1, 30] is a technique proposed quite recently. Many algorithms are new variants or combinations of other methods. Accelerated Random Search (ARS) [5] can be viewew as the modification of Pure Random Search (PRS), while Random Multistart algorithms [43], combine global random search and local deterministic techniques.

All the above mentioned iterative optimization techniques (except for some specific modifications, for instance non-Markovian versions of Random Multistart are presented in [22]), and many other optimization methods, can

(3)

be represented as discrete–time inhomogeneous Markov processes of the form (1.1) xt+1 = Tt(xt, yt), for t = 0, 1, 2, . . . ,

where xt represents the sequence of states successively transformed by the algorithm, yt is the sequence of independently sampled points which rep- resents the probability distributions of the algorithm and Tt stands for the sequence of the deterministic ”methods” of the algorithm. The main aim of this thesis is to provide a general theoretical framework for the study of the convergence of such optimization processes under conditions that can be verified in practice. Some applications are presented.

We refer to [10] for the general study of the processes of the form (1.1).

As stated there, every Markov Chain on a separable metric space has such representation. Recursions of the form (1.1) have been studied for various purposes, including optimization, iterated function systems (IFS), fractals, control theory and other applications. Many examples, which correspond to the time–homogeneous situation, are given in [23, 13, 16, 19]. These ref- erences focus mainly on the classical problem regarding the convergence of processes (1.1) which is how to prove the convergence to the unique station- ary distribution.

This thesis is a continuation of papers [25, 26, 34, 27, 28, 29] which deal with the problem how to prove that the process given by equation (1.1) con- verges towards A?. The general approach to this problem remains the same:

equation (1.1) induces a nonautonomous dynamical system which acts on the metric space M (A) of Borel probability measures on A and is determined by the family of Foias operators P(Ttt): M (A) 3 µ → P(Ttt)µ ∈ M (A) corresponding to equation (1.1). They are given by

P(Ttt)µ(C) = (µ × νt)(Tt−1(C)), for C ∈ B(A),

where vtis the probability distribution of yt. The goal is to prove the global attractiveness of the set M? = {µ ∈ M (A) : µ(A?) = 1} which corresponds to the global convergence of the algorithm. Is is not assumed that the Foias operators are continuous (the continuity is equivalent to the Feller property, see [23],[19]) and the weaker assumption is used instead. Thus, the dynamical system corresponding to equation (1.1) is in fact a pseudo–

dynamical system, see [31]. In the proof of the main result, Theorem 4.3, the Lyapunov fuction technique is used, and we consider the function given by V (µ) = R f dµ, µ ∈ M . The previous papers work under assumption R

B

f (Tt(x, y))νt(dy) ≤ f (x), x ∈ A, which is equivalent to E(f (Xt+1)|Xt= x) ≤ f (x), x ∈ A,

and, in fact, aims into the class of methods for which the sequence f (Xt) is a supermartingale. Under the above inequality the Lyapunov function V satisfies V (µt+1) ≤ V (µt), where µt is the trajectory of the algorithm given by µt+1 = P(Ttt)µt. Lyapunov functions arise quite naturally as a tool ensuring stability of the process Xtin various contexts, see for example

(4)

Chapter VIII in [4]. The presented thesis concerns the case in which the above supermartingale inequality is replaced with a softer condition. Under the assumption of this paper the function V is not necessarily monotonically decreasing along trajectories µtbut it satisfies V (µt+1) ≤ V (µt) + εt, where εt → 0. This is not a typical use of Lyapunov function, however still it can be shown that under some additional convergence assumptions it goes to zero along trajectories. This approach provides sufficient convergence conditions for the class of methods significantly wider than the class of supermartingales. Well known Simulated Annealing Algorithm, which will be analysed in Section 10, is the example of a non–supermartingale method.

The main result of this thesis is Theorem 4.3 which presents sufficient conditions for the convergence of the process Xt towards A?. The method- ology behind it is based on the above mentioned topological approach to the stochastic optimization. The weak convergence topology on M is con- sidered in the proof, the Lyapunov function V is used and it is shown that the probability distributions µt of Xt are weakly convergent towards M?. . The classical general convergence results are usually based on the classical probability theory [40], [32]. In case of non-supermartingale type methods, Markov chains theory is sometimes used for the convergence analysis, see for example [38] or [3].

This paper is organized as follows. Section 2 presents basic ideas of the weak convergence of probability measures ad proves Observation 2.1 which will be used in next section. Section 3 presents the general equivalences between basic types of stochastic convergence of random variables in the context of global optimization. It generalizes the classical results stated in [32]. Section 4 presents and discusses the main result, Theorem 4.3. In particular, it is shown the results of previous papers, including papers [27], [28], are conclusions of this general result. At the end of Section 4, Theorem 4.10, an additional result which regards the stability type properties of Xt, is presented. Section 5 prepares the necessary tools of the weak convergence of probability measures and Section 6 prepares the tools of the dynamical systems theory. Next Section 7 presents the proofs of Theorem 4.3 and The- orem 4.10. Further four sections present the applications of the theorems from Section 4 to the following optimization methods: Grenade Explosion Method, the Elitist Evolution Strategy, Simulated Annealing and Acceler- ated Random Search. The results concerning Grenade Explosion Method and the Evolution Strategy are recalled from paper [28]. Finally, Appendix recalls basic definitions and facts from probability theory and presents the mathematical notation used in the thesis.

Acknowledgments. I would like to thank Professor Jerzy Ombach for bringing my attention to the subject of this thesis and for the last four years of inspiration and motivation to work.

The project was supported by National Science Centre grant based on De- cision DEC-2013/09/N/ST/04262.

(5)

2. Weak convergence of Borel probability measures In this section we recall some basic facts about weak convergence of Borel probability measures and next prove Observation 2.1 which will be used in next section. More details on weak convergence one can find, for example, in [9, 14, 39].

Let (S, dS) be a separable metric space, let B(S) denote the sigma-algebra of Borel subsets of S and let M = M (S) denote the space of Borel probability measures on S. A sequence µn∈ M (S) weakly converges to some µ ∈ M (S) iff any of the following equivalent conditions is satisfied:

• for any bounded continuous function h : S → R (2.1)

Z

S

h dµn→ Z

S

h dµ, as n → ∞,

• for any upper semi–continuous function h : S → R bounded from above

(2.2) lim sup

n→∞

Z

S

hdµn≤ Z

S

hdµ, as n → ∞.

As S is separable, the topology of weak convergence on M (S) is metriz- able. One of accessible metrics is

dM1, ν2) = inf{ε > 0 : ν1(D) ≤ ν2(D(ε)) + ε for any Borel set D}, where D(ε) = {x ∈ S : dS(x, y) < ε for some y ∈ D}. The metric dM is called Prohorov metric, or sometimes L´evy – Prohorov metric. It is a simple observation that the mapping

S 3 s −→ δs ∈ M (S),

where δs denotes a Dirac measure concentrated on the point s ∈ S, is con- tinuous and injective.

The weak topology on M (S) is separable. Jf S0 is a countable set dense in S then the set

M0= {

n

X

i=1

piδsi|n ∈ N; si∈ S0; pi∈ Q ∩ [0, ∞);

n

X

i=1

pi = 1}

is dense in M (S) (and countable). If we assume that the metric space S is compact (thus separable), then M (S) is compact.

Observation 2.1 expresses the stochastic convergence of the sequence of random variables (which are distributed according to some sequence µn ∈ M (S)) to a Borel set D ⊂ S in terms of weak convergence of their proba- bility distributions towards the set M?(D) of probability distributions con- centrated on D.

(6)

Observation 2.1. Let D ∈ B(S) and M?(D) = {µ ∈ M | µ(D) = 1}. For any sequence µn∈ M (S) we have

dMn, M?(D)) → 0 ⇐⇒ ∀ε > 0 µn(D(ε)) → 1.

Proof. Assume that dMn, M?(D)) → 0 which means that for some se- quence mn ∈ M?(D) we have dMn, mn) → 0. From the definition of the Prohorov metric we have:

(2.3) dMn, mn) ≥ inf{ε > 0 : µn(D(ε)) + ε ≥ mn(D) = 1}.

Fix ε > 0. Because dMn, mn) → 0, there is n0 such that for any n > n0 we have µn(D(ε)) + ε ≥ 1. Again by (2.3) and dMn, mn) → 0, for any positive natural k there is nk> n0 such that for any n > nk

µn(D(ε)) + ε

k ≥ µn D(ε

k) + ε

k ≥ 1.

In consequence, µn(D(ε)) → 1 as n → ∞, which finishes the first part of the proof as ε > 0 was chosen arbitrarily.

Assume now that µn(D(ε)) → 1 for some ε > 0. Without loss of gener- ality we assume that µn(D(ε)) > 0, n ∈ N. Let ¯µn∈ M?(D(ε)) be defined by

¯

µn(C) = µn(C ∩ D(ε))

µn(D(ε)) , C ∈ B(S).

As µn(D(ε)) → 1, it is a simple observation that dMn, ¯µn) → 0. Hence, as ε > 0 is chosen arbitrarily, to finish the prove it will be enough to find a sequence mn ∈ M?(D) with lim sup

n→∞

dM(mn, ¯µn) ≤ 2ε. Because the convex combinations of Dirac measures concentrated on elements from D(ε) are dense in M?(D(ε)) there are probability measures rn ∈ M?(D(ε)), n ∈ N, of the form

rn=

tn

X

i=1

pniδsni, where sin∈ D(ε),

such that dM(rn, ¯µn) ≤ ε. For any sinlet din∈ D be a point with dS(sin, din) <

ε. Now we can define mn =

tn

P

i=1

pniδdni. It is a simple observation that dM(rn, mn) ≤ ε – in fact, for any C ∈ B(S), if sin ∈ C, then din∈ C(ε) and hence rn(C) ≤ mn(C(ε)) ≤ mn(C(ε))+ε. We thus have lim sup

n→∞

dM(mn, ¯µn) ≤ lim sup

n→∞

(dM(mn, rn) + dM(rn, ¯µn)) ≤ 2ε. Letting ε → 0 we finish the proof.

 Remark 2.2. In the case D = {d} we have M?(D) = {δd} and it is easy to see that Observation 2.1 generalizes a well known equivalence between the weak convergence and the convergence in probability of random variables towards one point distributed limit.

(7)

3. Some Equivalences for Global Convergence

This section presents the relations between basic types of stochastic global convergence. Without loss of generality we assume that we deal with the global minimization problem. The presented results are simple but they generalize many observations stated in the literature and will be useful in further sections. In particular, Theorem 3.6 presents the general equiva- lences between various global convergence modes in the class of optimization methods with the supermartingale property.

Let (A, d) be a separable metric space and f : A → R be a Borel measur- able function having its global minimum f? = 0. We denote

(1) A?= {x ∈ A : f (x) = 0},

(2) Aδ= {x ∈ A : f (x) ≤ δ}, where δ > 0, (3) A(δ) = {x ∈ A : f (x) < δ}, where δ > 0,

(4) A?(ε) = {x ∈ A : d(x, A?) < ε}, where ε > 0 and d(x, A?) =

a∈Ainf?d(x, a).

Let (Ω, Σ, P ) be a probability space and let Xt: Ω → A be a measurable sequence which represents the successive states of the algorithm. The global minimization task usually stands for either generating a sequence xt ∈ A which converge towards the set A? of solutions of the global minimization problem f (x) = 0 or generating a sequence xt∈ A which satisfies f (xt) → 0.

We will say that a sequence Xt: Ω → A, where t = 0, 1, . . ., stochastically converges to A? ⊂ A, which we will denote by Xt→ As ?, iff

∀ε > 0 lim

t→∞P (d(Xt, A?) < ε) = 1.

For monotonic methods (methods which satisfy f (Xt+1) ≤ f (Xt)) the stronger condition P (d(Xt, A?) → 0) = 1 is usually considered in the literature.

Recall that f (x) < δ iff x ∈ A(δ). The following condition (the conver- gence of f (Xt) in probability to 0) is often considered in the context of global minimization:

(3.1) ∀δ > 0 P (Xt∈ A(δ)) → 1.

Naturally, some algorithms satisfy the condition (3.1) and do not satisfy the stronger convergence mode P (f (Xt) → f?) = 1. For an example, see Theorems 1 and 2 in [6] which say that under appropriate assumptions the sequence f (Xt) generated by the Simulated Annealing algorithm does not converge surely to zero but still it satisfies (3.1). If we apply Observation 3.2 to the results of [6] we obtain that this is also an example of a method which converges stochastically towards A? but not with probability one.

Throughout this chapter we assume that function f astisfies the following conditions:

(8)

A1) ∀ε > 0 ∃δ > 0 A(δ) ⊂ A?(ε), A2) ∀δ > 0 ∃ε > 0 A?(ε) ⊂ A(δ).

The conditions A1) and A2) are rather natural and imply that d(xt, A?) → 0 ⇔ f (xt) → 0,

for any sequence xt∈ A.

Observation 3.1. If for some δ0 > 0 the underlevel set Aδ0 is compact and the function f is continuous on Aδ0 then conditions A1) and A2) are satisfied. In fact, under these assumptions f is uniformly continuous on Aδ0 which implies A2). The function f is also bounded from the minimal value f?= 0 besides any set A?(ε), which follows from the continuity of f and the compactness of sets Aδ0\ A?(ε) for any  > 0 small enough, and proves A1).

The observation below states, in particular, that under A1) and A2) both above mentioned interpretations of global minimization problem are equiv- alent. Observation 3.2 is simple but generalizes many existing observations.

For example, in the case where A? is a singleton, the equivalence from state- ment (1) was noticed in [6], and the second equivalence was proved in [35].

Under assumptions closely related to A1), A2), the equivalence from state- ment (2) was noticed in [32].

Observation 3.2. Assume that the function f : A → R satisfies conditions A1) and A2). Then:

(1) The following conditions are equivalent:

(a) Xt converges stochastically to A?,

(b) probability distributions of Xt converge towards M? = {µ ∈ M (A)|µ(A?) = 1} in the Prohorov metric,

(c) f (Xt) converges in probability to 0, (d) f (Xt) converge to 0 in distributions (2) The following conditions are equivalent:

(a) f (Xt) → 0 almost sure (b) d(Xt, A?) → 0 almost sure.

(3) Assume additionally that the measurable functions f (Xt) and d(Xt, A?) are bounded from the above by some measurable function Z : Ω → [0, +∞) with E(Z) < ∞. Then the following conditions are equiva- lent:

(a) E(f (Xt)) −→ 0, (b) E(d(Xt, A?)) −→ 0.

Additionally, under the above boundedness assumption, they are equiv- alent to conditions 1(a),1(b),1(c),1(d).

Proof. To prove the first statement note that A1) and A2) imply that

∀ε > 0 ∃δ > 0 P (Xt∈ A?(ε)) ≥ P (Xt∈ A(δ)) and

∀δ > 0 ∃ε > 0 P (Xt∈ A(δ)) ≥ P (Xt∈ A?(ε)).

(9)

This proves the equivalence between conditions 1a) and 1c). Condition 1c) means that f (Xt) goes in probability to a constant limit 0. The limit is one- point distributed and, from Observation 2.1 applied to S = R and D = {0}, this is equivalent to the weak convergence of probability distributions of f (Xt) towards Dirac measure δ0, and hence the equivalence 1c) ⇔ 1d) holds true. The equivalence 1a) ⇔ 1b) also is a straightforward conclusion of Observation 2.1. To show the second statement it is enough to note that for any sequence xt ∈ A we have d(xt, A?) → 0 ⇔ f (xt) → 0, which follows directly from A1), A2). To see the third statement it is enough to notice that under the boundedness assumption the sequences f (Xt) and d(Xt, A?) are uniformly integrable and hence the expected value convergence is equivalent to the convergence in probability in case of both sequences

f (Xt) and d(Xt, A?). 

Remark 3.3. Conditions E(f (Xt)) → 0 and E(d(Xt, A?)) → 0, which are, in general, stronger that 1(a),1(b),1(c),1(d), are not equivalent to each other in general unbounded situation. For example, consider the function f : [0, ∞) 3 x → x2 ∈ [0, ∞) and the sequence of probability distributions µn= n2n−12 · δ{1

n}+n12· δ{n}. The sequence Xt, if distributed according to µt, satisfy E(d(Xt, A?)) → 0 but E(f (Xt)) ≥ 1.

We will focus now on methods which satisfy the following supermartingale inequality:

E(f (Xt+1)|f (Xt), f (Xt−1), . . . , f (X0)) ≤ f (Xt) a. s.

The above inequality follows from the stronger condition E(f (Xt+1)|Xt, Xt−1, . . . , X0) ≤ f (Xt) a. s.

which is easier to verify in practice. In particular, if the sequence Xt is a Markov chain, then the above supermartingale–type inequalities follow from the following inequality:

(3.2) E(f (Xt+1))|Xt= x) ≤ f (x), x ∈ A.

Lemma 3.4. Assume that the sequence Xtis a Markov chain with E(f (Xt)) <

+∞, t ∈ N. If equation (3.2) is satisfied then f (Xt) is a supermartingale.

Proof. Let Σt = Σ(f (Xt), . . . , f (X0)), t ∈ N. Since, Xt is a Markov chain, we have

E(f (Xt+1)|Xt, . . . , X0) = E(f (Xt+1)|Xt), a. s.

and, since Σt⊂ Σ(Xt, . . . , X0),

E(f (Xt+1)|Σt) = E (E(f (Xt+1)|Xt, . . . , X0)|Σt) = E (E(f (Xt+1)|Xt)|Σt) . From the monotonicity of the conditional expectation it follows that it is enough to show E(f (Xt+1)|Xt) ≤ f (Xt) a.s. We will show that this fol- lows from equation (3.2). In fact, as the E(f (Xt+1)|Xt) is measurable with respect to Σ(Xt), there is a Borel function h : A → R such that

(10)

E(f (Xt+1)|Xt) = h(Xt). Hence we have E(f (Xt+1)|Xt= x) = h(x) ≤ f (x).

This implies E(f (Xt+1)|Xt) ≤ f (Xt), which finishes the proof.  Remark 3.5. Monotonic algorithms (in which the sequence f (Xt) decreases along time a.s.) possess the supermartingale property. Furthermore, many optimization methods ”‘remember”’ the best (in sense of the cost function) found point and therefore can be treated as monotonic methods. In partic- ular, the convergence of the sequence Xτ (t) is often considered, where

τ (t) = min{i ≤ t : f (Xi) = min

j=0,...,tf (Xj)}.

Clearly Xτ (t)is measurable, monotonic and satisfies f (Xτ (t)) = min

i=0,1,...,tf (Xt).

Observation 3.2 leads to:

Theorem 3.6. Assume that the function f : A → R satisfies A1), A2). If f (Xt) is a supermartingale then the following conditions are equivalent:

(1) probability distributions of Xt converge to

M? = {µ ∈ M (A)|µ(A?) = 1} in weak convergence topology (2) Xt→A? stochastically,

(3) d(Xt, A?) → 0 with probability one, (4) f (Xt) converges to 0 in distribution, (5) f (Xt) converges to 0 in probability, (6) f (Xt) converges to 0 with probability one,

If we assume additionally that the measurable functions f (Xt) and d(Xt, A?) are bounded from the above by some measurable function Z : Ω → [0, +∞) with E(Z) < ∞ then the above conditions (1),(2),(3),(4),(5),(6) are equiv- alent to the following conditions:

(7) E(d(Xt, A?)) −→ 0, (8) E(f (Xt)) & 0.

Proof. The equivalences 4) ⇔ 5) ⇔ 6)) are a simple consequence of super- martingale properties: to show 4) ⇔ 5) ⇔ 6) it is enough to notice 4) ⇒ 6) which, as the inequality 0 ≤ E(f (Xt+1)) ≤ E(f (Xt) ≤ f (X0) < ∞, t ∈ N, is satisfied, follows from the fact that L1− bounded supermartingale f (Xt) converge almost sure to a measurable limit and the fact that the limit- ing probability distribution must be unique. Observation 3.2 completes the proof of the first part of the theorem. To prove the second part we can ap- ply the Dominated Convergence Theorem to the random variables d(Xt, A?) and f (Xt) which leads to 3) ⇒ 7) and 4) ⇒ 8). Implications 7) ⇒ 2) and 8) ⇒ 5) are clear as convergence in mean is stronger than convergence in

probability. 

Remark 3.7. The above theorem assumes that f (Xt) is a supermartingale and, in particular, E(f (X0)) < ∞. This assumption is satisfied, for exam- ple, in a natural situation in which the starting point X0= x0 is fixed.

(11)

Remark 3.8. If f (Xt) is a monotonic sequence with Ef (X0) < ∞ then it is a supermartingale bounded from above by Z = f (X0). Thus, in the monotonic case, condition (8) is equivalent to conditions (1) − . . . − (6).

Remark 3.9. For any Borel probability measures on A µ1 and µ2 let kµ1− µ2k = sup

B∈B(A)

1(B) − µ2(B)|.

denote the total variation distance. This concept, natural in the analysis of Markov chains convergence towards the unique stationary distribution under appropriate irreducibility–type assumptions, is not applicable to the convergence analysis in the presented context. In a standard situation A ⊂ Rn the probability distributions µt of Xt are absolutely continuous with respect to the Lebesgue Measure µ and often we have µ(A?) = 0 which leads to ||µt− m|| = 1 for any m ∈ M?.

4. Main Result and Consequences

In this section we formulate the main result of this thesis, Theorem 4.3, and next we show some of the consequences. In particular, we prove that the main results of previous papers [27], [28], [29] are conclusions of Theorem 4.3.

From now on we assume that the function f : A → [0, ∞) is continu- ous. Let (B, dB) a be separable metric space. Let the sequence of random variables Xt: Ω → A, t ∈ N, be defined by the following nonautonomous equation:

(4.1) Xt+1= Tt(Xt, Yt), where

• Yt: Ω → B, t ∈ N, are random variables

• Tt: A × B −→ A, t ∈ N, are Borel measurable

• the random variables X0, Y0, Y1, · · · are independent.

Remark 4.1. For the given optimization method it is possible to construct various theoretical models (4.1). The results presented in this section pro- vide general sufficient conditions for the convergence of Xttowards A?. They can be more or less convenient to use depending on the choice of a model (4.1).

Remark 4.2. Given a stochastic process Xt of the form (4.1) it is easy to find a separable space ¯B and measurable functions ¯Yt: Ω → ¯B and T : A × ¯B → A such that Xt+1 = T (Xt, ¯Yt). The easiest way is to put B = B × N (which is separable and metrizable), ¯¯ Yt= (Yt, t) and T (x, y, t) = Tt(x, y). More generally, it can be useful to consider ¯B = B × T , where T ⊂ M(A × B, A), and T (x, y, S) = S(x, y), S ∈ T . However, in some cases the theoretical representations for stochastic algorithms with Ttchanging in time arise naturally. The general results presented in this section will thus concern the general situation of the form (4.1).

(12)

Let T = M(A × B, A) denote a topological space of all measurable oper- ators T : A × B −→ A equipped with the topology of uniform convergence:

a sequence {Tn}n∈N⊂ T converges to a limit T ∈ T iff sup

(a,b)∈A×B

d(Tn(a, b), T (a, b))n→∞−→ 0.

We will be focus mostly on the case where the space A is compact, thus bounded, in which the above topology is induced by the uniform conver- gence metric.

Let N = M(B) denote the topological space of Borel probability mea- sures on B equipped with the weak convergence topology. The space T × N is endowed with the product topology. Let νt= PYt denotes the distribution of Yt, t = 0, 1, · · · . It is easy to see that the distributions of Xt are deter- mined by the initial distribution µ0 of X0 and the sequence {(Tt, νt)}t=0. The assumptions of the theorems presented in this section present the re- lations between the the pairs (Tt, νt) and the function f under which the convergence of the process Xt towards A? occurs.

For any δ > 0 we define sets U (δ) ⊂ T × N and U0(δ) ⊂ T × N as follows:

T × N ⊃ U (δ) 3 (T, ν) ⇐⇒def



 R

B

f (T (x, y))v(dy) ≤ f (x) for x /∈ A(δ) R

B

f (T (x, y))v(dy) ≤ δ for x ∈ A(δ)

T × N ⊃ U0(δ) 3 (T, ν) ⇐⇒def



 R

B

f (T (x, y))v(dy) < f (x) for x /∈ A(δ) R

B

f (T (x, y))v(dy) ≤ δ for x ∈ A(δ).

Recall that a function F : S → R given on a metric space S is upper semi–

continuous (lower semi–continuous) at x0 ∈ S iff for any sequence xn∈ S, if xn → x0, then lim sup

n→∞

F (xn) ≤ F (x0) (lim inf

n→∞ F (xn) ≥ F (x0)). Recall that a family of sets {Un}n∈Nis called a decreasing family iff Un+1⊂ Un, n ∈ N.

To simplify the notation in Theorem 4.3, let

t−1

P

i=t

δi := 0, t ∈ N, where δi ∈ R is a sequence. This theorem is the main result of this thesis.

Theorem 4.3. Assume that A is a compact metric space. Let {U0k}k∈N be a decreasing family of compact sets with U0k⊂ U0(1k) such that the following conditions are satisfied:

(A1) for any k ∈ N, any pair (T, ν) ∈ U0k and x ∈ A, f ◦ T is upper semi–continuous at (x, y) for ν-almost any y ∈ B,

(B1) ∀t ∈ N (Tt, νt) ∈ U (δt), where δt> 0 is a sequence with δt→ 0,

(13)

(C1) for any k ∈ N the sequence (Tt, νt) contains a subsequence (Ttk

n, νtk

n) ∈ U0k such that lim

n→∞Skn= 0, where:

Snk=

tkn+1−1

X

i=tkn+1

δi. Then

∀ε > 0 P (d(Xt, A) < ε)t→∞−→ 1 and Ef (Xt)t→∞−→ ∞.

Remark 4.4. For t ∈ N and x ∈ A, we have (4.2)

Z

B

f (Tt(x, y))vt(dy) = E(f (Xt+1)|Xt= x).

In fact, since Xt and Yt are independent, we have E(f (Xt+1)|Xt) = E(f (Tt(Xt, Yt))|Xt) =

Z

B

f (Tt(Xt, y))vt(dy).

This makes conditions (B1) and (C1) more intuitive. In particular, U (δ) 3 (Tt, νt) ⇐⇒def

(E(f (Xt+1)|Xt= x) ≤ f (x) for x /∈ A(δ), E(f (Xt+1)|Xt= x) ≤ δ for x ∈ A(δ).

Note that Skn = P

i∈Cnk

δi, where Cnk is a set of indexes i ∈ N between tkn

and tkn+1, i.e. between n-th and (n + 1)-th visit in the set U0k. Thus Cnk can be empty, which would simplify the analysis. The following theorem is a conclusion of this simple observation.

Theorem 4.5. Assume that A is a compact metric space. Let {U0k}k∈N be a decreasing family of compact sets with U0k ⊂ U0(1k) and such that the following conditions are satisfied:

(A1) for any k ∈ N, (T, ν) ∈ U0kand x ∈ A, f ◦T is upper semi–continuous at (x, y) for ν-almost any y ∈ B,

(C2) for any t ∈ N, (Tt, νt) belongs to U0kt, where kt is a sequence with kt→ ∞.

Then

∀ε > 0 P (d(Xt, A) < ε)t→∞−→ 1 and Ef (Xt)t→∞−→ ∞.

Proof. We will use Theorem 4.3. As U0kt ⊂ U (k1

t), condition (B1) of The- orem 4.3 is satisfied with δt = k1

t. To prove that condition (C1) follows from condition (C2) it is enough to note that for any k ∈ N almost all elements of the sequence {Snk}n∈N are equal to 0 as the family {U0k}k∈N is

decreasing. 

(14)

Theorems 4.6 and 4.7 are main results of paper [28]. Theorem 4.6 is presented here with the strengthen thesis.

Theorem 4.6. Assume that A is compact and that U0⊂ T ×N is a compact set such that the following conditions are satisfied:

(A) for any (T, ν) ∈ U0 and x ∈ A, f ◦ T is upper semi–continuous in (x, y) for any y from some set of full measure ν,

(B) for any x ∈ A and t ∈ N, (4.3)

Z

B

f (Tt(x, y))vt(dy) ≤ f (x), (C) for any (T, ν) ∈ U0 and x ∈ A \ A

(4.4)

Z

B

f (T (x, y))v(dy) < f (x)

If the sequence (Tt, νt) contains a subsequence (Ttn, νtn) ∈ U0, then d(Xt, A?) −→ 0 and f (Xt) −→ 0 almost sure.

From equation (4.2) it follows that conditions (B) and (C) take the more intuitive following form

E(f (Xt+1)|Xt= x) ≤ f (x) and E(f (Xt+1)|Xt= x) < f (x).

Thus, from Lemma 3.4 it follows that under condition (B) the sequence f (Xt) is a supermartingale.

Proof of Theorem 4.6 We will use Theorem 4.3. To see that condition (B1) is satisfied it is enough to note that (Tt, νt) ∈ T

δ>0

U (δ), t ∈ N. Now, we define the decreasing family of compact sets {U0k}k∈N by U0k = U0. The set U0 contains a subsequence (Ttk, νtk) and it is a simple observation that conditions (A1) and (C1) holds true. The assumptions of Theorem 4.3 are thus satisfied which leads to Xt → As ?. Theorem 3.6 finishes the proof as condition (B) implies that the sequence f (Xt) is a supermartingale.  In the monotonic case the compactness of A can be replaced with a softer condition. For any δ > 0 and T : A × B → A let

Tδ= T |Aδ×B: Aδ× B −→ A, where Aδ= {x ∈ A : f (x) ≤ δ}.

For any U ⊂ T × N and δ > 0 let

(U )δ = {(Tδ, ν) : (T, ν) ∈ U }.

Clearly, if A and U0⊂ T ×N are compact, then Aδand (U0)δare compact for any δ > 0. In the case A = Rn the continuity of f implies that for any δ > 0 the set Aδ is compact iff it is bounded. In this case the compactness of sets Aδ is equivalent to f (xn) → ∞ for any sequence xn with |xn| → ∞, where | · | is a norm on Rn.

(15)

Theorem 4.7. Assume that Aδ is compact for any δ > 0. Let U0 ⊂ T × N be such that conditions (A) and (C) of Theorem 4.6 are satisfied and the set (U0)δ is compact for any δ > 0. Assume additionally:

(B’) for any t ∈ N and x ∈ A, y ∈ B

f (Tt(x, y)) ≤ f (x).

If the sequence ut= (Tt, νt)contains a subsequence {utn: n = 0, 1, . . .} ⊂ U0, then

P (d(Xt, A) → 0, t → ∞) = 1.

Proof. Fix x0∈ A. The set (U0)f (x0)is compact. If µ0= δx0, then supp µ0= {x0} ⊂ Af (x0). The set Af (x0) is compact, Tt(Af (x0)× B) ⊂ Af (x0) for any t ∈ N and A? ⊂ Af (x0). Thus, under assumption µ0 = δx0, we can use Theorem 4.6 with respect to the function f |Af (x0). In particular, the sequence f (Xt) converge to 0 in probability

From equation (4.1) it easily follows that there are measurable mappings Tt: A × Bt+1 → A, t ∈ N, such that Xt+1 = Tt(X0, Y0, . . . , Yt) (mappings Tt will be defined formally by equation (4.6)). For any x ∈ A, let Xt(x) denote the sequence defined by Xt+1(x) = Tt(Xt(x), Yt) and X0(x) = x. In other words,

(4.5) Xt+1(x) = Tt(x, Y0, . . . , Yt).

Now let µ0 ∈ M be the probability distribution of X0. The X0 is indepen- dent of the sequence Yt. Fix δ > 0. We have

P (f (Xt+1) < δ) = P (f (Tt(X0, Y0, . . . , Yt)) < δ) = E 1{f (Tt(X0,Y0,...,Yt))<δ} . From Fubini’s Theorem

E 1{f (Tt(X0,Y0,...,Yt))<δ} = Z

A

E 1{f (Tt(x,Y0,...,Yt))<δ} µ0(dx).

Hence, as E 1{f (Tt(x,Y0,...,Yt))<δ} = P (f (Xt+1(x)) < δ), we have P (f (Xt+1) < δ) =

Z

A

P (f (Xt+1(x)) < δ)µ0(dx).

We have shown that the functions ϕt(x) := P (f (Xt(x)) < δ) satisfy ϕt(x) → 0 for any x ∈ A. We can thus use Dominated convergence theorem to obtain

P (f (Xt+1) < δ) = Z

A

P (f (Xt+1(x)) < δ)µ0(dx) −→ 0 as t → ∞.

As δ > 0 has been chosen arbitrarily, we have that f (Xt) goes to 0 in

probability. Theorem 3.6 completes the proof. 

(16)

Condition (C) of Theorem 4.6 implies that given Xt= x the optimization method is able to reach a region with values smaller than f (x) in a single step. This condition is replaced with the n-step inequality condition (C’) in the next theorem.

For any T1: A × Bk → A, where k ∈ N, and T2: A × B → A, let T2◦ T1: A × Bk+1 3 (a, b1. . . , bk+1) −→ T2(T1(a, b1, . . . , bk), bk+1) ∈ A.

For any s ∈ N \ {0} and (T0, . . . , Ts) ∈ Ts+1 we define (4.6) Ts,...,0= Ts◦ . . . ◦T0: A × Bs+1 → A.

We will write T0:= T0 and Tt:= Tt◦ . . . ◦T0, t ∈ N \ {0}. We have:

Xt+1= Tt(X0, Y0, . . . , Yt).

Theorems 4.8 and 4.9 are main results of [29]. Theorem 4.8 is presented with the strengthen thesis.

Theorem 4.8. Assume that A is compact. Let U0 ⊂ T × N be a compact set such that:

(A’) for any (T, ν) ∈ U0 and x ∈ A, T is continuous in (x, y) for any y from some set of full measure ν,

(B) for any t ∈ N and x ∈ A

(4.7)

Z

B

f (Tt(x, y))vt(dy) ≤ f (x),

(C’) there is s ∈ N \ {0} such that for any {( ¯Ti, ¯νi) : i = 1, · · · , s} ⊂ U0

and x ∈ A \ A (4.8)

Z

Bs

f T¯s(x, y1, · · · , ys)

¯

νs× · · · × ¯ν1 (dys, · · · , dy1) < f (x),

where ¯Ts= ¯Ts,...,1 is defined by (4.6).

If there is a subsequence utk = (Ttk, νtk) such that



utk, . . . , utk+(s−1)



∈ U0× · · · × U0 then

P (d(Xt, A) → 0) = 1 and P (f (Xt) → 0) = 1.

Condition (C’) means that E(f (Xtk+s)|Xtk = x) < f (x), where tk is a subsequence which satisfies the assumptions of the theorem.

Condition (A’) is stronger than (A) and this influences the applicabil- ity of the results. For instance, we can consider the simple Pure Random Search method (PRS) which samples at every step t a candidate point yt from some distribution ν and next the operator T chooses from {xt, yt} the point with the smaller value of f . Note that T is continuous besides the level sets lc= f−1({c}) which can be of positive ν measure but the function

(17)

f ◦ T (x, y) = max{f (x), f (y)} is continuous everywhere.

Proof of Theorem 4.8 The proof is based on a technical construction.

We will work under assumptions and notation of Theorem 4.8 and the proof will use Theorem 4.6. Let p1: U0 → T be a projection on the first coordinate.

Let

T0 = p1(U0) ∪ {Tt: t ∈ N} and eB = Bs× T0s.

Note that eB, equipped with the product metric, is a separable metric space.

Let

T : A × Bs× Ts3 (x, b1, . . . , bs, T1, . . . , Ts) −→ Ts,...,1(x, b1, . . . , bs) ∈ A, where Ts,...,1 is defined as in equation (4.6).

Without loss of generality we assume that the subsequence (Ttk, νtk) ∈ U0 satisfies tk+1 ≥ tk+ s. Define a0 = 0 and

at+1=

(at+ s if at∈ {tk: k ∈ N}, at+ 1 if at∈ {t/ k: k ∈ N} . Let

IA: A × B 3 (a, b) −→ a ∈ A.

The subsequence Xat satisfies

Xa(t+1) = T (Xat, ˆYat, ˆTat), where

at := (Tat, IA, . . . , IA

| {z }

s

), ˆYat := (Yat, . . . , Yat

| {z }

s

) for at∈ {t/ k: k ∈ N}, and

at := Tat, . . . , Tat+(s−1) , ˆYat := Yat, . . . , Yat+(s−1)

for at∈ {tk: k ∈ N}.

The sequences ¯Xt= Xat, ¯Yt= ( ˆYat, ˆTat) satisfy

(4.9) X¯t+1= T ( ¯Xt, ˆYat, ˆTat) = T ( ¯Xt, ¯Yt).

We will show that equation (4.9) satisfies the assumptions of Theorem 4.6.

Note that that ¯Yt: Ω → Bs× T0s is an independent sequence. Define

M(Bs×T0s) ⊃ eU02 = {(νs×· · ·×ν1)×(δTs×· · ·×δT1) : (Ti, νi) ∈ U0, i = 1, . . . , s}.

The set eU02 is compact as a continuous image of a compact set U0 and hence the set eU0 = {T } × eU02 is compact. Furthermore, directly from the construction (and the assumption (C’) of Theorem 4.8), the sequence ¯νt= PY¯t contains a subsequence ¯νbk ( bk is an increasing subsequence of natural numbers) of the form

¯

νbk = (ν(tk+s−1)× · · · × νtk) × (δT(tk+s−1)× · · · × δTtk) ∈ eU02.

It remains to show that the sequence ¯Xtand the set eU0 satisfy the conditions (A),(B),(C) of Theorem 4.6. Conditions (B) and (C) of Theorem 4.6

(18)

are immediately satisfied from assumptions (B),(C’) of Theorem 4.8, the construction of the algorithm ¯Xt and the Fubini’s Theorem. Now we will show that from condition (A’) of Theorem 4.8 it follows that for any x from A and any ¯ν ∈ eU02 ⊂ M(Bs× T0s) the T is continuous at (x, y) for

¯

ν − almost any y = (T, b) ∈ eB = Bs× T0s which will prove condition (A) of Theorem 4.6. In fact, if (T1i, . . . , Tsi) → (T1, . . . , Ts) uniformly and bi = (bi1, . . . , bis) → (b1, . . . , bs) = b, then Ts,...,1i → Ts,...,1 uniformly and, as T (x, bi, Ti) = Ts,...,1i (x, bi), it is enough to see that Ts,...,1(x, bi) → Ts,...,1(x, b) which is true for any b from some set of full measure ν1 × · · · × νs. Thus Theorem 4.6 can be applied to the sequence ¯Xt = Xat which implies that E(f (Xat)) & 0. From condition (B) it follows that the sequence E(f (Xt)) is monotonic and thus we have E(f (Xt)) & 0. Theorem 3.6 completes the

proof. 

As before, the assumption of the compactness of Aδ can be released.

Theorem 4.9. Assume that Aδ is compact, δ > 0. Let U0⊂ T × N be such that (U0)δ is compact for any δ > 0 and the conditions (A’) and (C’) of Theorem 4.8 are satisfied. Assume that

(B’) for any (T, ν) ∈ U and x ∈ A, y ∈ B f (T (x, y)) ≤ f (x).

Let ut= (Tt, νt). If for any t ∈ N there is t0 ≥ t such that for any i ≤ s we have ut0+i∈ U0, then

P (d(Xt, A) → 0, t → ∞) = 1.

Proof. Theorem 4.9 follows from Theorem 4.8. To see this it is enough to repeat the argumentation of the proof of Theorem 4.7.  The results presented so far regard the attractiveness of the set A?. The theorem below is an additional result which concerns the stability–type prop- erties of A?. This result will be proved in Section 7.

Theorem 4.10. Let Xt be a process defined by (4.1), f : A → R be contin- uous and let A be compact.

(1) Under assumptions of Thereom 4.3 we have:

∀ε > 0 ∃δ > 0 ∃t0 ∀t > t0 P (d(Xt, A?) < δ) ≥ 1 − δ =⇒

=⇒ P (d(Xt+s, A?) < ε) ≥ 1 − ε, s ∈ N.

(2) Under condition (B) of Theorem 4.6 we have:

∀ε > 0 ∃δ > 0 P (d(X0, A?) < δ) ≥ 1−δ =⇒ P (d(Xt, A?) < ε) ≥ 1−ε, t ∈ N (3) Under condition (B’) of Theorem 4.7 we have:

∀ε > 0 ∃δ > 0 P (d(Xt, A?) < ε) ≥ P (d(X0, A?) < δ), t ∈ N.

(19)

Remark 4.11. In the context of global optimization the sequence Xt rep- resents the succesive states of the algorithm. To simplify the formulation of the presented results we assumed that it takes values in the domain A which is often not satisfied in practice. For example, Xt can represent the population of individuals and then Xt= (Xt1, · · · , Xtk) ∈ Akfor some k ≥ 1.

We can still apply the theorems to the global minimization problem given on the set Ak with respect to functions like

i=1,...,kmax f (xi) or

k

X

i=1

f (xi).

Remark 4.12. Deterministic parameters of a nonautonomous optimization methods determine the mappings Ttand the distributions νt, see Section 8 or Section 10 for some practical examples. Their values can change over time – for example, the algorithm can gradually move from the global search phase to the local search phase. However, some optimization schemes, like evolutionary algorithms, use self–adaptive mechanisms, see Section 9. In this case the non–deterministic parameters of the algorithm (self–adaptive parameters) take their values in some space C (in practice C is a subset of Rl) and then we can assume that Xt = ( ˆXt, Ct) : Ω → Ak× C. If C is compact, then we can apply the theorems to the set Ak× C, considering for example the function

Ak× C 3 (x, c) −→

k

X

i=1

f (xi) ∈ R.

In some cases, if C is not a compact space, still it is possible to consider the compactification of C and apply the theorems next. However, the pre- sented approach to the converges analysis of self–adaptive strategies ignores the stochastic mechanism of self–adaptive parameters and provides the suf- ficient convergence conditions which depend only on the set of values of self–adaptive parameters C.

Remark 4.13. Some authors consider the current best (in sense of the cost function) iterate convergence towards global minima which is equiva- lent to analyzing the condition P



i=0,1...,tmin f (Xi) & 0



=1. To analyze this convergence mode (which is weaker than E(f (Xt)) → 0) we can consider the sequence ¯Xt= (Xt, ˆXt), where ˆXt= Xkt and kt is the smallest natural number with f (Xkt) = min

i=0,1...,tf (Xi). More formally, X¯t+1=



Tt(Xt, Yt), T



Tt(Xt, Yt), ˆXt



,

where T : A×A → A chooses the point with the smaller value of f . The best iterate convergence of Xt is equivalent to the convergence of ¯Xt to A × A? which is the set of global minima of the function ¯f (x1, x2) = f (x2).

(20)

5. Weak Convergence of Borel Probability Measures First we will present some well known properties of weak convergence which we will use in further part of this section. Let S1 and S2 be separable metric spaces. If h : S1 → S2 is a Borel function, then for any µ ∈ M (S1), µh−1 denotes a Borel probability measure on S2, defined by µh−1(C) = µ h−1(C), for any C ∈ B(S2). For µ ∈ M (S1) and ν ∈ M (S2), µ × ν denotes the Cartesian product of measures µ and ν, which is uniquely characterized by (µ×ν)(C ×D) = µ(C)·ν(D), for all C ∈ B(S1), D ∈ B(S2).

As S1 and S2 are separable, we have B(S1× S2) = B(S1) ⊗ B(S2) = Σ(A1× A2: A1 ∈ S1, A2 ∈ S2). For any x ∈ S1 and a Borel set D ⊂ S1 × S2 the intersection Dx = {y ∈ S2: (x, y) ∈ D} ⊂ S2 is Borel and we have

(5.1) (µ × ν)(D) =

Z

S1

ν(Dx)µ(dx).

For any h : S1→ S2, let Dh = {x ∈ S1: h is not continuous in x}.

If S2= R, then we write Dhu= {x ∈ S1: h is not upper semi–continuous in x}.

Lemma 5.1 (Theorem 2.8 in [9]). Let µn, νn be sequences of Borel proba- bility measures on separable metric spaces S1, S2 respectively, with µn→ µ and νn→ ν for some µ ∈ M (S1) and ν ∈ M (S2). Then µn× νn→ µ × ν.

Lemma 5.2 (Theorem 2.7 in [9]). Assume that S1, S2 are metric spaces, µ ∈ M (S1) and T : S1 → S2 is measurable with µ(DT) = 0. Then, for any sequence µn of Borel probability measures on S1, if µn→ µ, then µnT−1 → µT−1.

The following lemma generalizes the weak convergence condition given by (2.2).

Lemma 5.3. Assume that µn is a sequence of Borel probability measures on a metric space S with µn → µ for some µ ∈ M (S). Then for any bounded from above and measurable function h : S → R, if µ(Dhu) = 0, then lim sup

n→∞

R

S

hdµn≤R

S

hdµ.

Proof. For any x0 ∈ S define define bh(x0) = lim

ε→0+ sup

x∈B(x0,ε)

h(x). It is a simple observation that h is upper semi–continuous at x0 iff h(x0) = bh(x0).

Furthermore, it is a simple observation that the function bh is upper semi–

continuous on S. Since h ≤ bh, µn→ µ and µ(Duh) = 0, we have:

lim sup

n→∞

Z

S

hdµn≤ lim sup

n→∞

Z

S

bhdµn≤ Z

S

bhdµ = Z

S\Dhu

bhdµ = Z

S\Dhu

hdµ = Z

S

hdµ.



Cytaty

Powiązane dokumenty

Most studies on the design of an induction motor using optimization techniques are concerned with the minimization of the motor cost and describe the optimization technique that

Define the Matrix structure implementing a square matrix (2-dimensional array) of real numbers with the following public methods:. • the constructor with two parameters – the number

These problems include: the formation of the company and marketing purposes ( profit, sales volume , market share of the enterprise) , seg- mentation of the market – partitioning

The resource saving strategy development on agricultural enterprise should be done through development of technical and agronomic production base and its

This model has been constructed by Wieczorek in [3], where the author was interested mainly in the existence and properties of competitive equilib- ria. In the present paper we

Palaeozoic brachiopods are suitable for the anal- ysis of possible dependence of their genera number on global sea-level changes because of their high diversity and its rapid

Tradycje polskiego duszpasterstwa wojskowego sięgają czasów najdawniejszych i są nierozłącznie związane z istnieniem państwowości polskiej. Początkowo nie było

Materiałem źródłowym pracy są dzieła św. Augustyna, które ukazują wkład Biskupa Hippony w tworzenie i wprowadzanie w życie koncepcji, określają­ cych pozycję