Nonautonomous dynamical systems in stochastic global optimization

(1)

Faculty of Mathematics and Informatics Institute of Mathematics

Dawid Tar lowski

Nonautonomous Dynamical Systems in Stochastic Global

Optimization

Phd Thesis

written under the supervision of prof. dr hab. Jerzy Ombach

Krak´ow 2014

(2)

Contents

1. Introduction 1

2. Weak convergence of Borel probability measures 4

3. Some Equivalences for Global Convergence 6

4. Main Result and Consequences 10

5. Weak Convergence of Borel Probability Measures 19 6. Some Concepts of Dynamical Systems in Metric Spaces 22

7. Proof of Main Result 27

8. Grenade Explosion Method 32

9. The Evolution Strategy (µ/ρ + λ) 35

10. Simulated Annealing 38

11. Accelerated Random Serch 42

12. Appendix 44

References 46

1. Introduction

Let (A, d) be a separable metric space and let f : A → R be a continuous function having its global minimum min f . We assume that min f = 0. Let

A^?= {x ∈ A : f (x) = 0}.

In the context of optimization the function f is often called a problem function and the elements of A^? are often called the solutions of the global minimization problem. There is a number of iterative numerical techniques designed for finding an element from A^?. Under some assumptions on the function f like, for example, the differentiability, deterministic optimization techniques [24] can be used for solving optimization problems. However, global minima are usually hard to locate. Stochastic optimization techniques [43, 42, 33] are usually not dependent on the smoothness of the function. At the same time, if properly configured, these methods can be very effective in finding global minimums. There is great variety of available techniques, among them we have genetic and evolutionary algorithms [38, 37, 7, 36], Sim- ulated Annealing (SA) [6, 41, 21, 3] or swarm intelligence algorithms like Particle Swarm Optimization (PSO) [12, 11], Artificial Bee Colony (ABC) [17] or Ant Colony Optimization (ACO) [15]. Grenade Explosion Method (GEM) [2, 1, 30] is a technique proposed quite recently. Many algorithms are new variants or combinations of other methods. Accelerated Random Search (ARS) [5] can be viewew as the modification of Pure Random Search (PRS), while Random Multistart algorithms [43], combine global random search and local deterministic techniques.

All the above mentioned iterative optimization techniques (except for some specific modifications, for instance non-Markovian versions of Random Multistart are presented in [22]), and many other optimization methods, can

(3)

be represented as discrete–time inhomogeneous Markov processes of the form (1.1) x_t+1 = T_t(x_t, y_t), for t = 0, 1, 2, . . . ,

where x_t represents the sequence of states successively transformed by the algorithm, yt is the sequence of independently sampled points which represents the probability distributions of the algorithm and T_t stands for the sequence of the deterministic ”methods” of the algorithm. The main aim of this thesis is to provide a general theoretical framework for the study of the convergence of such optimization processes under conditions that can be verified in practice. Some applications are presented.

We refer to [10] for the general study of the processes of the form (1.1).

As stated there, every Markov Chain on a separable metric space has such representation. Recursions of the form (1.1) have been studied for various purposes, including optimization, iterated function systems (IFS), fractals, control theory and other applications. Many examples, which correspond to the time–homogeneous situation, are given in [23, 13, 16, 19]. These references focus mainly on the classical problem regarding the convergence of processes (1.1) which is how to prove the convergence to the unique stationary distribution.

This thesis is a continuation of papers [25, 26, 34, 27, 28, 29] which deal with the problem how to prove that the process given by equation (1.1) converges towards A^?. The general approach to this problem remains the same:

equation (1.1) induces a nonautonomous dynamical system which acts on the metric space M (A) of Borel probability measures on A and is determined by the family of Foias operators P_(T_t_,ν_t₎: M (A) 3 µ → P_(T_t_,ν_t₎µ ∈ M (A) corresponding to equation (1.1). They are given by

P_(T_t_,ν_t₎µ(C) = (µ × νt)(T_t⁻¹(C)), for C ∈ B(A),

where v_tis the probability distribution of y_t. The goal is to prove the global attractiveness of the set M^? = {µ ∈ M (A) : µ(A^?) = 1} which corresponds to the global convergence of the algorithm. Is is not assumed that the Foias operators are continuous (the continuity is equivalent to the Feller property, see [23],[19]) and the weaker assumption is used instead. Thus, the dynamical system corresponding to equation (1.1) is in fact a pseudo–

dynamical system, see [31]. In the proof of the main result, Theorem 4.3, the Lyapunov fuction technique is used, and we consider the function given by V (µ) = R f dµ, µ ∈ M . The previous papers work under assumption R

B

f (T_t(x, y))ν_t(dy) ≤ f (x), x ∈ A, which is equivalent to E(f (X_t+1)|X_t= x) ≤ f (x), x ∈ A,

and, in fact, aims into the class of methods for which the sequence f (Xt) is a supermartingale. Under the above inequality the Lyapunov function V satisfies V (µt+1) ≤ V (µt), where µt is the trajectory of the algorithm given by µ_t+1 = P_(T_t_,ν_t₎µ_t. Lyapunov functions arise quite naturally as a tool ensuring stability of the process X_tin various contexts, see for example

(4)

Chapter VIII in [4]. The presented thesis concerns the case in which the above supermartingale inequality is replaced with a softer condition. Under the assumption of this paper the function V is not necessarily monotonically decreasing along trajectories µtbut it satisfies V (µt+1) ≤ V (µt) + εt, where ε_t → 0. This is not a typical use of Lyapunov function, however still it can be shown that under some additional convergence assumptions it goes to zero along trajectories. This approach provides sufficient convergence conditions for the class of methods significantly wider than the class of supermartingales. Well known Simulated Annealing Algorithm, which will be analysed in Section 10, is the example of a non–supermartingale method.

The main result of this thesis is Theorem 4.3 which presents sufficient conditions for the convergence of the process X_t towards A^?. The method- ology behind it is based on the above mentioned topological approach to the stochastic optimization. The weak convergence topology on M is considered in the proof, the Lyapunov function V is used and it is shown that the probability distributions µ_t of X_t are weakly convergent towards M^?. . The classical general convergence results are usually based on the classical probability theory [40], [32]. In case of non-supermartingale type methods, Markov chains theory is sometimes used for the convergence analysis, see for example [38] or [3].

This paper is organized as follows. Section 2 presents basic ideas of the weak convergence of probability measures ad proves Observation 2.1 which will be used in next section. Section 3 presents the general equivalences between basic types of stochastic convergence of random variables in the context of global optimization. It generalizes the classical results stated in [32]. Section 4 presents and discusses the main result, Theorem 4.3. In particular, it is shown the results of previous papers, including papers [27], [28], are conclusions of this general result. At the end of Section 4, Theorem 4.10, an additional result which regards the stability type properties of Xt, is presented. Section 5 prepares the necessary tools of the weak convergence of probability measures and Section 6 prepares the tools of the dynamical systems theory. Next Section 7 presents the proofs of Theorem 4.3 and The- orem 4.10. Further four sections present the applications of the theorems from Section 4 to the following optimization methods: Grenade Explosion Method, the Elitist Evolution Strategy, Simulated Annealing and Acceler- ated Random Search. The results concerning Grenade Explosion Method and the Evolution Strategy are recalled from paper [28]. Finally, Appendix recalls basic definitions and facts from probability theory and presents the mathematical notation used in the thesis.

Acknowledgments. I would like to thank Professor Jerzy Ombach for bringing my attention to the subject of this thesis and for the last four years of inspiration and motivation to work.

The project was supported by National Science Centre grant based on De- cision DEC-2013/09/N/ST/04262.

(5)

2. Weak convergence of Borel probability measures In this section we recall some basic facts about weak convergence of Borel probability measures and next prove Observation 2.1 which will be used in next section. More details on weak convergence one can find, for example, in [9, 14, 39].

Let (S, dS) be a separable metric space, let B(S) denote the sigma-algebra of Borel subsets of S and let M = M (S) denote the space of Borel probability measures on S. A sequence µ_n∈ M (S) weakly converges to some µ ∈ M (S) iff any of the following equivalent conditions is satisfied:

• for any bounded continuous function h : S → R (2.1)

Z

S

h dµn→ Z

S

h dµ, as n → ∞,

• for any upper semi–continuous function h : S → R bounded from above

(2.2) lim sup

n→∞

Z

S

hdµn≤ Z

S

hdµ, as n → ∞.

As S is separable, the topology of weak convergence on M (S) is metrizable. One of accessible metrics is

dM(ν1, ν2) = inf{ε > 0 : ν1(D) ≤ ν2(D(ε)) + ε for any Borel set D}, where D(ε) = {x ∈ S : d_S(x, y) < ε for some y ∈ D}. The metric d_M is called Prohorov metric, or sometimes L´evy – Prohorov metric. It is a simple observation that the mapping

S 3 s −→ δ_s ∈ M (S),

where δs denotes a Dirac measure concentrated on the point s ∈ S, is continuous and injective.

The weak topology on M (S) is separable. Jf S0 is a countable set dense in S then the set

M₀= {

n

X

i=1

p_iδ_s_i|n ∈ N; si∈ S₀; p_i∈ Q ∩ [0, ∞);

n

X

i=1

p_i = 1}

is dense in M (S) (and countable). If we assume that the metric space S is compact (thus separable), then M (S) is compact.

Observation 2.1 expresses the stochastic convergence of the sequence of random variables (which are distributed according to some sequence µ_n ∈ M (S)) to a Borel set D ⊂ S in terms of weak convergence of their probability distributions towards the set M^?(D) of probability distributions concentrated on D.

(6)

Observation 2.1. Let D ∈ B(S) and M^?(D) = {µ ∈ M | µ(D) = 1}. For any sequence µ_n∈ M (S) we have

d_M(µ_n, M^?(D)) → 0 ⇐⇒ ∀ε > 0 µ_n(D(ε)) → 1.

Proof. Assume that dM(µn, M^?(D)) → 0 which means that for some sequence m_n ∈ M^?(D) we have d_M(µ_n, m_n) → 0. From the definition of the Prohorov metric we have:

(2.3) d_M(µ_n, m_n) ≥ inf{ε > 0 : µ_n(D(ε)) + ε ≥ m_n(D) = 1}.

Fix ε > 0. Because d_M(µ_n, m_n) → 0, there is n₀ such that for any n > n₀ we have µn(D(ε)) + ε ≥ 1. Again by (2.3) and dM(µn, mn) → 0, for any positive natural k there is n_k> n₀ such that for any n > n_k

µ_n(D(ε)) + ε

k ≥ µ_n D(ε

k) + ε

k ≥ 1.

In consequence, µn(D(ε)) → 1 as n → ∞, which finishes the first part of the proof as ε > 0 was chosen arbitrarily.

Assume now that µn(D(ε)) → 1 for some ε > 0. Without loss of generality we assume that µ_n(D(ε)) > 0, n ∈ N. Let ¯µ_n∈ M^?(D(ε)) be defined by

¯

µ_n(C) = µ_n(C ∩ D(ε))

µn(D(ε)) , C ∈ B(S).

As µn(D(ε)) → 1, it is a simple observation that dM(µn, ¯µn) → 0. Hence, as ε > 0 is chosen arbitrarily, to finish the prove it will be enough to find a sequence mn ∈ M^?(D) with lim sup

n→∞

dM(mn, ¯µn) ≤ 2ε. Because the convex combinations of Dirac measures concentrated on elements from D(ε) are dense in M^?(D(ε)) there are probability measures r_n ∈ M^?(D(ε)), n ∈ N, of the form

rn=

tn

X

i=1

pⁿ_iδsⁿ_i, where sⁱ_n∈ D(ε),

such that d_M(r_n, ¯µ_n) ≤ ε. For any sⁱ_nlet dⁱ_n∈ D be a point with d_S(sⁱ_n, dⁱ_n) <

ε. Now we can define mn =

tn

P

i=1

pⁿ_iδdⁿ_i. It is a simple observation that d_M(r_n, m_n) ≤ ε – in fact, for any C ∈ B(S), if sⁱ_n ∈ C, then dⁱ_n∈ C(ε) and hence rn(C) ≤ mn(C(ε)) ≤ mn(C(ε))+ε. We thus have lim sup

n→∞

dM(mn, ¯µn) ≤ lim sup

n→∞

(d_M(m_n, r_n) + d_M(r_n, ¯µ_n)) ≤ 2ε. Letting ε → 0 we finish the proof.

Remark 2.2. In the case D = {d} we have M^?(D) = {δ_d} and it is easy to see that Observation 2.1 generalizes a well known equivalence between the weak convergence and the convergence in probability of random variables towards one point distributed limit.

(7)

3. Some Equivalences for Global Convergence

This section presents the relations between basic types of stochastic global convergence. Without loss of generality we assume that we deal with the global minimization problem. The presented results are simple but they generalize many observations stated in the literature and will be useful in further sections. In particular, Theorem 3.6 presents the general equivalences between various global convergence modes in the class of optimization methods with the supermartingale property.

Let (A, d) be a separable metric space and f : A → R be a Borel measurable function having its global minimum f^? = 0. We denote

(1) A^?= {x ∈ A : f (x) = 0},

(2) A_δ= {x ∈ A : f (x) ≤ δ}, where δ > 0, (3) A(δ) = {x ∈ A : f (x) < δ}, where δ > 0,

(4) A^?(ε) = {x ∈ A : d(x, A^?) < ε}, where ε > 0 and d(x, A^?) =

a∈Ainf^?d(x, a).

Let (Ω, Σ, P ) be a probability space and let X_t: Ω → A be a measurable sequence which represents the successive states of the algorithm. The global minimization task usually stands for either generating a sequence xt ∈ A which converge towards the set A^? of solutions of the global minimization problem f (x) = 0 or generating a sequence xt∈ A which satisfies f (x_t) → 0.

We will say that a sequence X_t: Ω → A, where t = 0, 1, . . ., stochastically converges to A^? ⊂ A, which we will denote by X_t→ A^s ^?, iff

∀ε > 0 lim

t→∞P (d(Xt, A^?) < ε) = 1.

For monotonic methods (methods which satisfy f (X_t+1) ≤ f (X_t)) the stronger condition P (d(Xt, A^?) → 0) = 1 is usually considered in the literature.

Recall that f (x) < δ iff x ∈ A(δ). The following condition (the convergence of f (X_t) in probability to 0) is often considered in the context of global minimization:

(3.1) ∀δ > 0 P (X_t∈ A(δ)) → 1.

Naturally, some algorithms satisfy the condition (3.1) and do not satisfy the stronger convergence mode P (f (X_t) → f^?) = 1. For an example, see Theorems 1 and 2 in [6] which say that under appropriate assumptions the sequence f (Xt) generated by the Simulated Annealing algorithm does not converge surely to zero but still it satisfies (3.1). If we apply Observation 3.2 to the results of [6] we obtain that this is also an example of a method which converges stochastically towards A^? but not with probability one.

Throughout this chapter we assume that function f astisfies the following conditions:

(8)

A1) ∀ε > 0 ∃δ > 0 A(δ) ⊂ A^?(ε), A2) ∀δ > 0 ∃ε > 0 A^?(ε) ⊂ A(δ).

The conditions A1) and A2) are rather natural and imply that d(xt, A^?) → 0 ⇔ f (xt) → 0,

for any sequence xt∈ A.

Observation 3.1. If for some δ₀ > 0 the underlevel set A_δ₀ is compact and the function f is continuous on A_δ₀ then conditions A1) and A2) are satisfied. In fact, under these assumptions f is uniformly continuous on A_δ₀ which implies A2). The function f is also bounded from the minimal value f^?= 0 besides any set A^?(ε), which follows from the continuity of f and the compactness of sets Aδ0\ A^?(ε) for any > 0 small enough, and proves A1).

The observation below states, in particular, that under A1) and A2) both above mentioned interpretations of global minimization problem are equivalent. Observation 3.2 is simple but generalizes many existing observations.

For example, in the case where A^? is a singleton, the equivalence from statement (1) was noticed in [6], and the second equivalence was proved in [35].

Under assumptions closely related to A1), A2), the equivalence from statement (2) was noticed in [32].

Observation 3.2. Assume that the function f : A → R satisfies conditions A1) and A2). Then:

(1) The following conditions are equivalent:

(a) Xt converges stochastically to A^?,

(b) probability distributions of X_t converge towards M^? = {µ ∈ M (A)|µ(A^?) = 1} in the Prohorov metric,

(c) f (Xt) converges in probability to 0, (d) f (X_t) converge to 0 in distributions (2) The following conditions are equivalent:

(a) f (X_t) → 0 almost sure (b) d(Xt, A^?) → 0 almost sure.

(3) Assume additionally that the measurable functions f (X_t) and d(X_t, A^?) are bounded from the above by some measurable function Z : Ω → [0, +∞) with E(Z) < ∞. Then the following conditions are equivalent:

(a) E(f (Xt)) −→ 0, (b) E(d(X_t, A^?)) −→ 0.

Additionally, under the above boundedness assumption, they are equivalent to conditions 1(a),1(b),1(c),1(d).

Proof. To prove the first statement note that A1) and A2) imply that

∀ε > 0 ∃δ > 0 P (X_t∈ A^?(ε)) ≥ P (Xt∈ A(δ)) and

∀δ > 0 ∃ε > 0 P (X_t∈ A(δ)) ≥ P (X_t∈ A^?(ε)).

(9)

This proves the equivalence between conditions 1a) and 1c). Condition 1c) means that f (X_t) goes in probability to a constant limit 0. The limit is one- point distributed and, from Observation 2.1 applied to S = R and D = {0}, this is equivalent to the weak convergence of probability distributions of f (X_t) towards Dirac measure δ₀, and hence the equivalence 1c) ⇔ 1d) holds true. The equivalence 1a) ⇔ 1b) also is a straightforward conclusion of Observation 2.1. To show the second statement it is enough to note that for any sequence xt ∈ A we have d(x_t, A^?) → 0 ⇔ f (xt) → 0, which follows directly from A1), A2). To see the third statement it is enough to notice that under the boundedness assumption the sequences f (Xt) and d(Xt, A^?) are uniformly integrable and hence the expected value convergence is equivalent to the convergence in probability in case of both sequences

f (Xt) and d(Xt, A^?).

Remark 3.3. Conditions E(f (X_t)) → 0 and E(d(X_t, A^?)) → 0, which are, in general, stronger that 1(a),1(b),1(c),1(d), are not equivalent to each other in general unbounded situation. For example, consider the function f : [0, ∞) 3 x → x² ∈ [0, ∞) and the sequence of probability distributions µ_n= ⁿ²_n⁻¹2 · δ_{1

n}+_n¹2· δ_{n}. The sequence X_t, if distributed according to µ_t, satisfy E(d(Xt, A^?)) → 0 but E(f (Xt)) ≥ 1.

We will focus now on methods which satisfy the following supermartingale inequality:

E(f (Xt+1)|f (Xt), f (Xt−1), . . . , f (X0)) ≤ f (Xt) a. s.

The above inequality follows from the stronger condition E(f (Xt+1)|Xt, Xt−1, . . . , X0) ≤ f (Xt) a. s.

which is easier to verify in practice. In particular, if the sequence X_t is a Markov chain, then the above supermartingale–type inequalities follow from the following inequality:

(3.2) E(f (Xt+1))|Xt= x) ≤ f (x), x ∈ A.

Lemma 3.4. Assume that the sequence X_tis a Markov chain with E(f (X_t)) <

+∞, t ∈ N. If equation (3.2) is satisfied then f (Xt) is a supermartingale.

Proof. Let Σt = Σ(f (Xt), . . . , f (X0)), t ∈ N. Since, Xt is a Markov chain, we have

E(f (Xt+1)|Xt, . . . , X0) = E(f (Xt+1)|Xt), a. s.

and, since Σ_t⊂ Σ(X_t, . . . , X₀),

E(f (Xt+1)|Σt) = E (E(f (Xt+1)|Xt, . . . , X0)|Σt) = E (E(f (Xt+1)|Xt)|Σt) . From the monotonicity of the conditional expectation it follows that it is enough to show E(f (Xt+1)|Xt) ≤ f (Xt) a.s. We will show that this follows from equation (3.2). In fact, as the E(f (X_t+1)|X_t) is measurable with respect to Σ(X_t), there is a Borel function h : A → R such that

(10)

E(f (Xt+1)|Xt) = h(Xt). Hence we have E(f (Xt+1)|Xt= x) = h(x) ≤ f (x).

This implies E(f (X_t+1)|X_t) ≤ f (X_t), which finishes the proof. Remark 3.5. Monotonic algorithms (in which the sequence f (Xt) decreases along time a.s.) possess the supermartingale property. Furthermore, many optimization methods ”‘remember”’ the best (in sense of the cost function) found point and therefore can be treated as monotonic methods. In particular, the convergence of the sequence X_{τ (t)} is often considered, where

τ (t) = min{i ≤ t : f (Xi) = min

j=0,...,tf (Xj)}.

Clearly X_{τ (t)}is measurable, monotonic and satisfies f (X_{τ (t)}) = min

i=0,1,...,tf (Xt).

Observation 3.2 leads to:

Theorem 3.6. Assume that the function f : A → R satisfies A1), A2). If f (X_t) is a supermartingale then the following conditions are equivalent:

(1) probability distributions of X_t converge to

M^? = {µ ∈ M (A)|µ(A^?) = 1} in weak convergence topology (2) X_t→A^? stochastically,

(3) d(X_t, A^?) → 0 with probability one, (4) f (Xt) converges to 0 in distribution, (5) f (X_t) converges to 0 in probability, (6) f (Xt) converges to 0 with probability one,

If we assume additionally that the measurable functions f (Xt) and d(Xt, A^?) are bounded from the above by some measurable function Z : Ω → [0, +∞) with E(Z) < ∞ then the above conditions (1),(2),(3),(4),(5),(6) are equivalent to the following conditions:

(7) E(d(Xt, A^?)) −→ 0, (8) E(f (X_t)) & 0.

Proof. The equivalences 4) ⇔ 5) ⇔ 6)) are a simple consequence of supermartingale properties: to show 4) ⇔ 5) ⇔ 6) it is enough to notice 4) ⇒ 6) which, as the inequality 0 ≤ E(f (Xt+1)) ≤ E(f (Xt) ≤ f (X0) < ∞, t ∈ N, is satisfied, follows from the fact that L₁− bounded supermartingale f (X_t) converge almost sure to a measurable limit and the fact that the limit- ing probability distribution must be unique. Observation 3.2 completes the proof of the first part of the theorem. To prove the second part we can apply the Dominated Convergence Theorem to the random variables d(X_t, A^?) and f (X_t) which leads to 3) ⇒ 7) and 4) ⇒ 8). Implications 7) ⇒ 2) and 8) ⇒ 5) are clear as convergence in mean is stronger than convergence in

probability.

Remark 3.7. The above theorem assumes that f (X_t) is a supermartingale and, in particular, E(f (X0)) < ∞. This assumption is satisfied, for example, in a natural situation in which the starting point X₀= x₀ is fixed.

(11)

Remark 3.8. If f (Xt) is a monotonic sequence with Ef (X0) < ∞ then it is a supermartingale bounded from above by Z = f (X₀). Thus, in the monotonic case, condition (8) is equivalent to conditions (1) − . . . − (6).

Remark 3.9. For any Borel probability measures on A µ1 and µ2 let kµ₁− µ₂k = sup

B∈B(A)

|µ₁(B) − µ2(B)|.

denote the total variation distance. This concept, natural in the analysis of Markov chains convergence towards the unique stationary distribution under appropriate irreducibility–type assumptions, is not applicable to the convergence analysis in the presented context. In a standard situation A ⊂ Rⁿ the probability distributions µt of Xt are absolutely continuous with respect to the Lebesgue Measure µ and often we have µ(A^?) = 0 which leads to ||µt− m|| = 1 for any m ∈ M^?.

4. Main Result and Consequences

In this section we formulate the main result of this thesis, Theorem 4.3, and next we show some of the consequences. In particular, we prove that the main results of previous papers [27], [28], [29] are conclusions of Theorem 4.3.

From now on we assume that the function f : A → [0, ∞) is continuous. Let (B, d_B) a be separable metric space. Let the sequence of random variables Xt: Ω → A, t ∈ N, be defined by the following nonautonomous equation:

(4.1) Xt+1= Tt(Xt, Yt), where

• Y_t: Ω → B, t ∈ N, are random variables

• T_t: A × B −→ A, t ∈ N, are Borel measurable

• the random variables X₀, Y0, Y1, · · · are independent.

Remark 4.1. For the given optimization method it is possible to construct various theoretical models (4.1). The results presented in this section provide general sufficient conditions for the convergence of Xttowards A^?. They can be more or less convenient to use depending on the choice of a model (4.1).

Remark 4.2. Given a stochastic process X_t of the form (4.1) it is easy to find a separable space ¯B and measurable functions ¯Y_t: Ω → ¯B and T : A × ¯B → A such that Xt+1 = T (Xt, ¯Yt). The easiest way is to put B = B × N (which is separable and metrizable), ¯¯ Y_t= (Y_t, t) and T (x, y, t) = Tt(x, y). More generally, it can be useful to consider ¯B = B × T , where T ⊂ M(A × B, A), and T (x, y, S) = S(x, y), S ∈ T . However, in some cases the theoretical representations for stochastic algorithms with Ttchanging in time arise naturally. The general results presented in this section will thus concern the general situation of the form (4.1).

(12)

Let T = M(A × B, A) denote a topological space of all measurable operators T : A × B −→ A equipped with the topology of uniform convergence:

a sequence {Tn}_n∈N⊂ T converges to a limit T ∈ T iff sup

(a,b)∈A×B

d(T_n(a, b), T (a, b))^n→∞−→ 0.

We will be focus mostly on the case where the space A is compact, thus bounded, in which the above topology is induced by the uniform convergence metric.

Let N = M(B) denote the topological space of Borel probability measures on B equipped with the weak convergence topology. The space T × N is endowed with the product topology. Let ν_t= P_Y_t denotes the distribution of Yt, t = 0, 1, · · · . It is easy to see that the distributions of Xt are determined by the initial distribution µ0 of X0 and the sequence {(Tt, νt)}^∞_t=0. The assumptions of the theorems presented in this section present the relations between the the pairs (Tt, νt) and the function f under which the convergence of the process X_t towards A^? occurs.

For any δ > 0 we define sets U (δ) ⊂ T × N and U₀(δ) ⊂ T × N as follows:

T × N ⊃ U (δ) 3 (T, ν) ⇐⇒^def





 R

B

f (T (x, y))v(dy) ≤ f (x) for x /∈ A(δ) R

B

f (T (x, y))v(dy) ≤ δ for x ∈ A(δ)

T × N ⊃ U₀(δ) 3 (T, ν) ⇐⇒^def





 R

B

f (T (x, y))v(dy) < f (x) for x /∈ A(δ) R

B

f (T (x, y))v(dy) ≤ δ for x ∈ A(δ).

Recall that a function F : S → R given on a metric space S is upper semi–

continuous (lower semi–continuous) at x0 ∈ S iff for any sequence x_n∈ S, if x_n → x₀, then lim sup

n→∞

F (x_n) ≤ F (x₀) (lim inf

n→∞ F (x_n) ≥ F (x₀)). Recall that a family of sets {Un}_n∈Nis called a decreasing family iff Un+1⊂ U_n, n ∈ N.

To simplify the notation in Theorem 4.3, let

t−1

P

i=t

δi := 0, t ∈ N, where δi ∈ R is a sequence. This theorem is the main result of this thesis.

Theorem 4.3. Assume that A is a compact metric space. Let {U₀^k}_k∈N be a decreasing family of compact sets with U₀^k⊂ U₀(¹_k) such that the following conditions are satisfied:

(A1) for any k ∈ N, any pair (T, ν) ∈ U0^k and x ∈ A, f ◦ T is upper semi–continuous at (x, y) for ν-almost any y ∈ B,

(B1) ∀t ∈ N (Tt, ν_t) ∈ U (δ_t), where δ_t> 0 is a sequence with δ_t→ 0,

(13)

(C1) for any k ∈ N the sequence (Tt, νt) contains a subsequence (T_t^k

n, ν_t^k

n) ∈ U₀^k such that lim

n→∞S^k_n= 0, where:

S_n^k=

t^k_n+1−1

X

i=t^k_n+1

δ_i. Then

∀ε > 0 P (d(X_t, A^∗) < ε)^t→∞−→ 1 and Ef (X_t)^t→∞−→ ∞.

Remark 4.4. For t ∈ N and x ∈ A, we have (4.2)

Z

B

f (Tt(x, y))vt(dy) = E(f (Xt+1)|Xt= x).

In fact, since X_t and Y_t are independent, we have E(f (Xt+1)|Xt) = E(f (Tt(Xt, Yt))|Xt) =

Z

B

f (Tt(Xt, y))vt(dy).

This makes conditions (B1) and (C1) more intuitive. In particular, U (δ) 3 (Tt, νt) ⇐⇒^def

(E(f (Xt+1)|Xt= x) ≤ f (x) for x /∈ A(δ), E(f (X_t+1)|X_t= x) ≤ δ for x ∈ A(δ).

Note that S^k_n = P

i∈C_n^k

δ_i, where C_n^k is a set of indexes i ∈ N between t^kn

and t^k_n+1, i.e. between n-th and (n + 1)-th visit in the set U₀^k. Thus C_n^k can be empty, which would simplify the analysis. The following theorem is a conclusion of this simple observation.

Theorem 4.5. Assume that A is a compact metric space. Let {U₀^k}_k∈N be a decreasing family of compact sets with U₀^k ⊂ U₀(¹_k) and such that the following conditions are satisfied:

(A1) for any k ∈ N, (T, ν) ∈ U0^kand x ∈ A, f ◦T is upper semi–continuous at (x, y) for ν-almost any y ∈ B,

(C2) for any t ∈ N, (Tt, ν_t) belongs to U₀^k^t, where k_t is a sequence with kt→ ∞.

Then

∀ε > 0 P (d(X_t, A^∗) < ε)^t→∞−→ 1 and Ef (X_t)^t→∞−→ ∞.

Proof. We will use Theorem 4.3. As U₀^k^t ⊂ U (_k¹

t), condition (B1) of The- orem 4.3 is satisfied with δ_t = _k¹

t. To prove that condition (C1) follows from condition (C2) it is enough to note that for any k ∈ N almost all elements of the sequence {S_n^k}_n∈N are equal to 0 as the family {U₀^k}_k∈N is

decreasing.

(14)

Theorems 4.6 and 4.7 are main results of paper [28]. Theorem 4.6 is presented here with the strengthen thesis.

Theorem 4.6. Assume that A is compact and that U₀⊂ T ×N is a compact set such that the following conditions are satisfied:

(A) for any (T, ν) ∈ U₀ and x ∈ A, f ◦ T is upper semi–continuous in (x, y) for any y from some set of full measure ν,

(B) for any x ∈ A and t ∈ N, (4.3)

Z

B

f (Tt(x, y))vt(dy) ≤ f (x), (C) for any (T, ν) ∈ U₀ and x ∈ A \ A^∗

(4.4)

Z

B

f (T (x, y))v(dy) < f (x)

If the sequence (Tt, νt) contains a subsequence (Ttn, νtn) ∈ U0, then d(X_t, A^?) −→ 0 and f (X_t) −→ 0 almost sure.

From equation (4.2) it follows that conditions (B) and (C) take the more intuitive following form

E(f (X_t+1)|X_t= x) ≤ f (x) and E(f (X_t+1)|X_t= x) < f (x).

Thus, from Lemma 3.4 it follows that under condition (B) the sequence f (Xt) is a supermartingale.

Proof of Theorem 4.6 We will use Theorem 4.3. To see that condition (B1) is satisfied it is enough to note that (T_t, ν_t) ∈ T

δ>0

U (δ), t ∈ N. Now, we define the decreasing family of compact sets {U₀^k}_k∈N by U₀^k = U0. The set U₀ contains a subsequence (T_t_k, ν_t_k) and it is a simple observation that conditions (A1) and (C1) holds true. The assumptions of Theorem 4.3 are thus satisfied which leads to X_t → A^s ^?. Theorem 3.6 finishes the proof as condition (B) implies that the sequence f (Xt) is a supermartingale. In the monotonic case the compactness of A can be replaced with a softer condition. For any δ > 0 and T : A × B → A let

T_δ= T |_A_δ×B: A_δ× B −→ A, where A_δ= {x ∈ A : f (x) ≤ δ}.

For any U ⊂ T × N and δ > 0 let

(U )δ = {(Tδ, ν) : (T, ν) ∈ U }.

Clearly, if A and U0⊂ T ×N are compact, then A_δand (U0)δare compact for any δ > 0. In the case A = Rⁿ the continuity of f implies that for any δ > 0 the set Aδ is compact iff it is bounded. In this case the compactness of sets A_δ is equivalent to f (x_n) → ∞ for any sequence x_n with |x_n| → ∞, where | · | is a norm on Rⁿ.

(15)

Theorem 4.7. Assume that Aδ is compact for any δ > 0. Let U0 ⊂ T × N be such that conditions (A) and (C) of Theorem 4.6 are satisfied and the set (U0)δ is compact for any δ > 0. Assume additionally:

(B’) for any t ∈ N and x ∈ A, y ∈ B

f (T_t(x, y)) ≤ f (x).

If the sequence ut= (Tt, νt)contains a subsequence {utn: n = 0, 1, . . .} ⊂ U0, then

P (d(Xt, A^∗) → 0, t → ∞) = 1.

Proof. Fix x₀∈ A. The set (U₀)_{f (x}₀₎is compact. If µ₀= δ_x₀, then supp µ₀= {x₀} ⊂ A_{f (x}₀₎. The set A_{f (x}₀₎ is compact, T_t(A_{f (x}₀₎× B) ⊂ A_{f (x}₀₎ for any t ∈ N and A^? ⊂ A_{f (x}₀₎. Thus, under assumption µ0 = δx0, we can use Theorem 4.6 with respect to the function f |A_{f (x0)}. In particular, the sequence f (X_t) converge to 0 in probability

From equation (4.1) it easily follows that there are measurable mappings T^t: A × B^t+1 → A, t ∈ N, such that Xt+1 = T^t(X₀, Y₀, . . . , Y_t) (mappings T^t will be defined formally by equation (4.6)). For any x ∈ A, let Xt(x) denote the sequence defined by Xt+1(x) = Tt(Xt(x), Yt) and X0(x) = x. In other words,

(4.5) Xt+1(x) = T^t(x, Y0, . . . , Yt).

Now let µ₀ ∈ M be the probability distribution of X₀. The X₀ is independent of the sequence Yt. Fix δ > 0. We have

P (f (X_t+1) < δ) = P (f (T^t(X₀, Y₀, . . . , Y_t)) < δ) = E 1_{{f (T}^t_(X₀_,Y₀_,...,Y_t_))<δ} . From Fubini’s Theorem

E 1_{{f (T}^t_(X₀_,Y₀_,...,Y_t_))<δ} = Z

A

E 1_{{f (T}^t_(x,Y₀_,...,Y_t_))<δ} µ₀(dx).

Hence, as E 1_{{f (T}t(x,Y0,...,Yt))<δ} = P (f (X_t+1(x)) < δ), we have P (f (Xt+1) < δ) =

Z

A

P (f (Xt+1(x)) < δ)µ0(dx).

We have shown that the functions ϕt(x) := P (f (Xt(x)) < δ) satisfy ϕ_t(x) → 0 for any x ∈ A. We can thus use Dominated convergence theorem to obtain

P (f (Xt+1) < δ) = Z

A

P (f (Xt+1(x)) < δ)µ0(dx) −→ 0 as t → ∞.

As δ > 0 has been chosen arbitrarily, we have that f (X_t) goes to 0 in

probability. Theorem 3.6 completes the proof.

(16)

Condition (C) of Theorem 4.6 implies that given Xt= x the optimization method is able to reach a region with values smaller than f (x) in a single step. This condition is replaced with the n-step inequality condition (C’) in the next theorem.

For any T¹: A × B^k → A, where k ∈ N, and T²: A × B → A, let T²◦ T¹: A × B^k+1 3 (a, b₁. . . , bk+1) −→ T²(T¹(a, b1, . . . , bk), bk+1) ∈ A.

For any s ∈ N \ {0} and (T⁰, . . . , T^s) ∈ T^s+1 we define (4.6) T_s,...,0= T^s◦ . . . ◦T⁰: A × B^s+1 → A.

We will write T⁰:= T0 and T^t:= Tt◦ . . . ◦T₀, t ∈ N \ {0}. We have:

Xt+1= T^t(X0, Y0, . . . , Yt).

Theorems 4.8 and 4.9 are main results of [29]. Theorem 4.8 is presented with the strengthen thesis.

Theorem 4.8. Assume that A is compact. Let U₀ ⊂ T × N be a compact set such that:

(A’) for any (T, ν) ∈ U₀ and x ∈ A, T is continuous in (x, y) for any y from some set of full measure ν,

(B) for any t ∈ N and x ∈ A

(4.7)

Z

B

f (T_t(x, y))v_t(dy) ≤ f (x),

(C’) there is s ∈ N \ {0} such that for any {( ¯Ti, ¯νi) : i = 1, · · · , s} ⊂ U0

and x ∈ A \ A^∗ (4.8)

Z

B^s

f T¯^s(x, y₁, · · · , y_s)

¯

ν_s× · · · × ¯ν₁ (dy_s, · · · , dy₁) < f (x),

where ¯T^s= ¯Ts,...,1 is defined by (4.6).

If there is a subsequence utk = (Ttk, νtk) such that

utk, . . . , ut_k+(s−1)

∈ U₀× · · · × U₀ then

P (d(Xt, A^∗) → 0) = 1 and P (f (Xt) → 0) = 1.

Condition (C’) means that E(f (X_t_k_+s)|X_t_k = x) < f (x), where t_k is a subsequence which satisfies the assumptions of the theorem.

Condition (A’) is stronger than (A) and this influences the applicabil- ity of the results. For instance, we can consider the simple Pure Random Search method (PRS) which samples at every step t a candidate point y_t from some distribution ν and next the operator T chooses from {xt, yt} the point with the smaller value of f . Note that T is continuous besides the level sets l_c= f⁻¹({c}) which can be of positive ν measure but the function

(17)

f ◦ T (x, y) = max{f (x), f (y)} is continuous everywhere.

Proof of Theorem 4.8 The proof is based on a technical construction.

We will work under assumptions and notation of Theorem 4.8 and the proof will use Theorem 4.6. Let p₁: U₀ → T be a projection on the first coordinate.

Let

T₀ = p1(U0) ∪ {Tt: t ∈ N} and eB = B^s× T₀^s.

Note that eB, equipped with the product metric, is a separable metric space.

Let

T : A × B^s× T^s3 (x, b₁, . . . , b_s, T¹, . . . , T^s) −→ T_s,...,1(x, b₁, . . . , b_s) ∈ A, where Ts,...,1 is defined as in equation (4.6).

Without loss of generality we assume that the subsequence (T_t_k, ν_t_k) ∈ U₀ satisfies tk+1 ≥ t_k+ s. Define a0 = 0 and

at+1=

(at+ s if at∈ {t_k: k ∈ N}, a_t+ 1 if a_t∈ {t/ _k: k ∈ N} . Let

I_A: A × B 3 (a, b) −→ a ∈ A.

The subsequence X_a_t satisfies

X_a_(t+1) = T (X_a_t, ˆY_a_t, ˆT_a_t), where

Tˆ_a_t := (T_a_t, I_A, . . . , I_A

| {z }

s

), ˆY_a_t := (Y_a_t, . . . , Y_a_t

| {z }

s

) for a_t∈ {t/ _k: k ∈ N}, and

Tˆ_a_t := T_a_t, . . . , T_a_t_+(s−1) , ˆY_a_t := Y_a_t, . . . , Y_a_t_+(s−1)

for a_t∈ {t_k: k ∈ N}.

The sequences ¯Xt= Xat, ¯Yt= ( ˆYat, ˆTat) satisfy

(4.9) X¯_t+1= T ( ¯X_t, ˆY_a_t, ˆT_a_t) = T ( ¯X_t, ¯Y_t).

We will show that equation (4.9) satisfies the assumptions of Theorem 4.6.

Note that that ¯Y_t: Ω → B^s× T₀^s is an independent sequence. Define

M(B^s×T₀^s) ⊃ eU₀² = {(ν^s×· · ·×ν¹)×(δ_T^s×· · ·×δ_T1) : (Tⁱ, νⁱ) ∈ U₀, i = 1, . . . , s}.

The set eU₀² is compact as a continuous image of a compact set U₀ and hence the set eU0 = {T } × eU₀² is compact. Furthermore, directly from the construction (and the assumption (C’) of Theorem 4.8), the sequence ¯ν_t= PY¯t contains a subsequence ¯ν_b_k ( b_k is an increasing subsequence of natural numbers) of the form

¯

ν_b_k = (ν_(t_k_+s−1)× · · · × ν_t_k) × (δ_T_(tk+s−1)× · · · × δ_T_tk) ∈ eU₀².

It remains to show that the sequence ¯X_tand the set eU₀ satisfy the conditions (A),(B),(C) of Theorem 4.6. Conditions (B) and (C) of Theorem 4.6

(18)

are immediately satisfied from assumptions (B),(C’) of Theorem 4.8, the construction of the algorithm ¯X_t and the Fubini’s Theorem. Now we will show that from condition (A’) of Theorem 4.8 it follows that for any x from A and any ¯ν ∈ eU₀² ⊂ M(B^s× T₀^s) the T is continuous at (x, y) for

¯

ν − almost any y = (T, b) ∈ eB = B^s× T₀^s which will prove condition (A) of Theorem 4.6. In fact, if (T₁ⁱ, . . . , T_sⁱ) → (T1, . . . , Ts) uniformly and bⁱ = (bⁱ₁, . . . , bⁱ_s) → (b₁, . . . , b_s) = b, then T_s,...,1ⁱ → T_s,...,1 uniformly and, as T (x, bⁱ, Tⁱ) = T_s,...,1ⁱ (x, bⁱ), it is enough to see that T_s,...,1(x, bⁱ) → T_s,...,1(x, b) which is true for any b from some set of full measure ν1 × · · · × ν_s. Thus Theorem 4.6 can be applied to the sequence ¯X_t = X_a_t which implies that E(f (Xat)) & 0. From condition (B) it follows that the sequence E(f (Xt)) is monotonic and thus we have E(f (X_t)) & 0. Theorem 3.6 completes the

proof.

As before, the assumption of the compactness of A_δ can be released.

Theorem 4.9. Assume that A_δ is compact, δ > 0. Let U₀⊂ T × N be such that (U0)δ is compact for any δ > 0 and the conditions (A’) and (C’) of Theorem 4.8 are satisfied. Assume that

(B’) for any (T, ν) ∈ U and x ∈ A, y ∈ B f (T (x, y)) ≤ f (x).

Let u_t= (T_t, ν_t). If for any t ∈ N there is t0 ≥ t such that for any i ≤ s we have ut0+i∈ U₀, then

P (d(Xt, A^∗) → 0, t → ∞) = 1.

Proof. Theorem 4.9 follows from Theorem 4.8. To see this it is enough to repeat the argumentation of the proof of Theorem 4.7. The results presented so far regard the attractiveness of the set A^?. The theorem below is an additional result which concerns the stability–type properties of A^?. This result will be proved in Section 7.

Theorem 4.10. Let Xt be a process defined by (4.1), f : A → R be continuous and let A be compact.

(1) Under assumptions of Thereom 4.3 we have:

∀ε > 0 ∃δ > 0 ∃t₀ ∀t > t₀ P (d(X_t, A^?) < δ) ≥ 1 − δ =⇒

=⇒ P (d(X_t+s, A^?) < ε) ≥ 1 − ε, s ∈ N.

(2) Under condition (B) of Theorem 4.6 we have:

∀ε > 0 ∃δ > 0 P (d(X₀, A^?) < δ) ≥ 1−δ =⇒ P (d(X_t, A^?) < ε) ≥ 1−ε, t ∈ N (3) Under condition (B’) of Theorem 4.7 we have:

∀ε > 0 ∃δ > 0 P (d(X_t, A^?) < ε) ≥ P (d(X₀, A^?) < δ), t ∈ N.

(19)

Remark 4.11. In the context of global optimization the sequence Xt represents the succesive states of the algorithm. To simplify the formulation of the presented results we assumed that it takes values in the domain A which is often not satisfied in practice. For example, Xt can represent the population of individuals and then X_t= (X_t¹, · · · , X_t^k) ∈ A^kfor some k ≥ 1.

We can still apply the theorems to the global minimization problem given on the set A^k with respect to functions like

i=1,...,kmax f (xⁱ) or

k

X

i=1

f (xⁱ).

Remark 4.12. Deterministic parameters of a nonautonomous optimization methods determine the mappings Ttand the distributions νt, see Section 8 or Section 10 for some practical examples. Their values can change over time – for example, the algorithm can gradually move from the global search phase to the local search phase. However, some optimization schemes, like evolutionary algorithms, use self–adaptive mechanisms, see Section 9. In this case the non–deterministic parameters of the algorithm (self–adaptive parameters) take their values in some space C (in practice C is a subset of R^l) and then we can assume that X_t = ( ˆX_t, C_t) : Ω → A^k× C. If C is compact, then we can apply the theorems to the set A^k× C, considering for example the function

A^k× C 3 (x, c) −→

k

X

i=1

f (xⁱ) ∈ R.

In some cases, if C is not a compact space, still it is possible to consider the compactification of C and apply the theorems next. However, the presented approach to the converges analysis of self–adaptive strategies ignores the stochastic mechanism of self–adaptive parameters and provides the sufficient convergence conditions which depend only on the set of values of self–adaptive parameters C.

Remark 4.13. Some authors consider the current best (in sense of the cost function) iterate convergence towards global minima which is equivalent to analyzing the condition P

i=0,1...,tmin f (Xi) & 0

=1. To analyze this convergence mode (which is weaker than E(f (X_t)) → 0) we can consider the sequence ¯X_t= (X_t, ˆX_t), where ˆX_t= X_k_t and k_t is the smallest natural number with f (Xkt) = min

i=0,1...,tf (Xi). More formally, X¯t+1=

Tt(Xt, Yt), T

Tt(Xt, Yt), ˆXt

,

where T : A×A → A chooses the point with the smaller value of f . The best iterate convergence of Xt is equivalent to the convergence of ¯Xt to A × A^? which is the set of global minima of the function ¯f (x₁, x₂) = f (x₂).

(20)

5. Weak Convergence of Borel Probability Measures First we will present some well known properties of weak convergence which we will use in further part of this section. Let S₁ and S₂ be separable metric spaces. If h : S1 → S₂ is a Borel function, then for any µ ∈ M (S1), µh⁻¹ denotes a Borel probability measure on S₂, defined by µh⁻¹(C) = µ h⁻¹(C), for any C ∈ B(S₂). For µ ∈ M (S1) and ν ∈ M (S2), µ × ν denotes the Cartesian product of measures µ and ν, which is uniquely characterized by (µ×ν)(C ×D) = µ(C)·ν(D), for all C ∈ B(S₁), D ∈ B(S₂).

As S1 and S2 are separable, we have B(S1× S₂) = B(S1) ⊗ B(S2) = Σ(A1× A₂: A₁ ∈ S₁, A₂ ∈ S₂). For any x ∈ S₁ and a Borel set D ⊂ S₁ × S₂ the intersection Dx = {y ∈ S2: (x, y) ∈ D} ⊂ S2 is Borel and we have

(5.1) (µ × ν)(D) =

Z

S1

ν(D_x)µ(dx).

For any h : S1→ S₂, let D_h = {x ∈ S1: h is not continuous in x}.

If S₂= R, then we write D_h^u= {x ∈ S₁: h is not upper semi–continuous in x}.

Lemma 5.1 (Theorem 2.8 in [9]). Let µ_n, ν_n be sequences of Borel probability measures on separable metric spaces S1, S2 respectively, with µn→ µ and ν_n→ ν for some µ ∈ M (S₁) and ν ∈ M (S₂). Then µ_n× ν_n→ µ × ν.

Lemma 5.2 (Theorem 2.7 in [9]). Assume that S₁, S₂ are metric spaces, µ ∈ M (S1) and T : S1 → S₂ is measurable with µ(DT) = 0. Then, for any sequence µ_n of Borel probability measures on S₁, if µ_n→ µ, then µ_nT⁻¹ → µT⁻¹.

The following lemma generalizes the weak convergence condition given by (2.2).

Lemma 5.3. Assume that µn is a sequence of Borel probability measures on a metric space S with µ_n → µ for some µ ∈ M (S). Then for any bounded from above and measurable function h : S → R, if µ(D_h^u) = 0, then lim sup

n→∞

R

S

hdµ_n≤R

S

hdµ.

Proof. For any x₀ ∈ S define define bh(x₀) = lim

ε→0⁺ sup

x∈B(x0,ε)

h(x). It is a simple observation that h is upper semi–continuous at x0 iff h(x0) = bh(x0).

Furthermore, it is a simple observation that the function bh is upper semi–

continuous on S. Since h ≤ bh, µ_n→ µ and µ(D^u_h) = 0, we have:

lim sup

n→∞

Z

S

hdµn≤ lim sup

n→∞

Z

S

bhdµn≤ Z

S

bhdµ = Z

S\D_h^u

bhdµ = Z

S\D_h^u

hdµ = Z

S

hdµ.